# Full text of "Population 1"

## See other formats

Daniel L. Hortt Harvard University Andrew G. Clark Pennsylvania State University Principles of Population Genetics THIRD EDITION raS^nauer Associates, Inc. Publishers Sunderland, Massachusetts THE COVER This image represents data from the first study of nucleotide sequence variation in a natural population, conducted by Martin Kreimian (1983, Nature 304 p 412-417). Each of the II vertical bands represents the 43 varying nucleotide sites in one allele of the gene alcohol dehydrogenase taken from a global distribution of the fruit fly Drosopiiila mrlaiwgtisler. The colors correspond lo the bases at each site: adenine = green, cyto- suie = yellow, guanine = blue, and thymine = red Note that at each position only two different bases were observed Blocks of sites in strong linkage disequilibrium can also he seen as repeated patterns of color The sequences are oriented with the 5' end at the top, and the nucleotide corresponding to the fast/slow difference in the ADH protein is the twelfth one from the bottom. To Barbara and Christine PRINCIPLES OF POPULATION GENETICS, Third Edition Copynghtp 1997 by Smauer Associates, Jnc All rights reserved. This hook may not be reproduced in whole or in part without permission from the publisher For information or to order, address Sinauer Associates, Inc , PO Box 407 23 Plumtiee Road, Sunderland, MA, 01375 U S A. FAX 413-5*9-] 118 Internet: publish ©sinauer com; http-// www.sinauer.com Library of Congress Cataloging-in-Publication Data Hard, Daniel L Principles of population genetics / Daniel L Hart I, Andrew C Clark — 3rdcrt. p. cm Includes bibliographical references and index ISBN 0-87893-306-9 (hardcovei) I Population genetics 2 Quantitative genetics. 3 Population genetics — Problems, exercises, etc. 1 Clark, Andrew G , 1954- f] Title QH4.H5.H3r) 1997 576 5'8— dc21 97-34505 CIP Printed in Canada 5 4 3 2 1 Table of Contents PREFACE XI GENETIC AND STATISTICAL BACKGROUND 1 Gene Expression and Gene Interaction 2 Gene Expression 3 The Genetic Code 6 Alleles 8 Genotype and Plienoh/pe 8 Dominance ami Gene In tc taction H Segregation and Re -jnibinatwn 12 Probability in Population Genetics 15 The Addition Rule 16 The Multiplication Ride 16 Repeated Trials 17 Phenotypic Diversity and Genetic Variation 20 Allele Frequencies in Populations 20 Parameters and Estimates 22 The Standard Erroi of an Estimate 12 Models in Population Genetics 26 Exponential Population Growth 27 Logistic Population Grmvih 31 Summary 33 Problems 34 □ GENETIC AND PHENOTYPIC VARIATION 37 Phenotypic Variation in Natural Populations 37 Continuous Variation: The Normal Distrib- ution 38 Mean and Variance 39 Centra! Limit Theorem 41 Discrete Mendcliau Variation 43 Experimental Methods for Detecting Genetic Variation 44 Protein Electrophoresis 45 The Southern Blot Procedure 48 The Polymerase Chain Reaction 51 Polymorphism and Heterozygosity 53 Allozyme Polymorphisms 54 How Representative A re Allozymes? 56 Polymorphisms in DNA Sequences 57 Nucleotide Polymorphism and Nucleotide Diversity 57 Uses of Genetic Polymorphisms 62 Multiple-Factor Inheritance 64 Summary 66 Problems 68 ORGANIZATION OF GENETIC VARIATION 74 Random Mating 72 Nonoverlapping Generations 73 The Hardy-Weinberg Principle 74 Random Mating of Genotypes versus Random Union of Gametes 76 Implications of the Hardy-Weinberg Principle 79 The Hardy-Weinberg Principle in Operation SO Compficat ions of Dominance 84 Frequency of Heterozygotes 87 Special Cases of Random Mating 88 Three or More Alkies 88 X-Lmked Genes 92 Table of Contents vii Linkage and Linkage Disequilibrium 95 Summary 106 Problems 107 F** SOURCES OF VARIATION 163 a POPULATION SUBSTRUCTURE 711 Hierarchical Population Structure 111 Reduction in Heterozygosity 112 Average Heterozygosity 114 Wright's F Statistics 117 Genetic Divergence among SuhpopuMions 120 Isolate Breaking: The Wahlund Principle 122 Wahlund s Principle and the Fixation Index 125 Genotype Frequencies in Subdivided Popu- lations 127 Population Genetics in DNA Typing 128 Polymorphisms Based on a Variable Num- ber of Tandem Repeats {VNTR) 129 Match Probabilities with Hardy-Weinberg Equilibrium and Linkage Equilibrium 132 Effects of Population Substructure 132 Inbreeding 135 Genotype Frequencies wtih Inbreeding 135 Relation between the Inbreeding Coefficient and the F Statistics 139 The Inbreeding Coefficient as a Probability 141 Genetic Effects of Inbreeding 145 Calculation of the Inbreeding Coefficient from Pedigrees 149 Regular Systems of Mat trig 1 53 Assortative Mating 155 Summary 158 Problems 159 Mutation 163 Irreversible Mutation 164 Reversible Muialiou 168 Probability of Fixation of a New Neutral Mutation 170 The hifintte-Alleles Model 1 74 Neutral Mutations 177 Linkage and Recombination 180 Piesumed Evoluhonan/ Benefit of Recombi- nation 181 Recombination and Polymorphism Wl Piecewise Recombination in Bacteria 186 Absence of Recombination in Animal Mitochondrial DNA 187 Migration 189 One-Way Migration 189 The Island Mode! of Migration 192 How Migration Limits Genetic Diveigence 194 Estimates of Migration Rates 196 Patterns of Migration 196 Transposable Elements 198 Factois Controlling the Population Dyna- mics of Transposable Elements 200 Insertion Sequences and Composite Tnuis- posous in Racier ta 200 Transposable Elements in Eitkaryotcs 204 Horizontal Transmission of Transposable Elements 204 Summary 206 Problems 208 UJ DARWINIAN SELECTION 211 Selection in Baploid Organisms 212 Discrete Geueiations 212 Continuous Tune 216 Change in Allele Frequent i/ in Haploids 217 viii Table of Contents Darwinian Fitness and Malthusian Fitness 218 Selection in Diploid Organisms 218 Change in Allele Frequency in Diploids 219 Time Required for a Given Change in AlJele Frequency 222 Application to the Evolution of Insecticide Resistance 226 Equilibria with Selection 227 Overdominance 228 Local Stability 232 Heterozygote Inferiority 234 The Adaptive Topography and the Role of Random Genetic Drift 236 Mutation-Selection Balance 236 Equilibrium Allele Frequencies 237 The Haldane-Muller Principle 239 More Complex Types of Selection 240 Frequency-Dependent Selection 240 Density-Dependent Selection 241 Fecundity Selection 241 Age-Structured Populations 242 Heterogeneous Environments and dines 242 Diversifying Selection 244 Differential Selection in the Sexes 246 X-lmked Genes 246 Gametic Selec t io n 246 Meiotie Drive 247 Multiple Alleles 250 Multiple Loci and Gene interaction: Epistasis 252 Sexual Selection 255 Kin Selection 256 Interdeme Selection and the Shifting Balance Theory 259 Summary 262 Problems 264 fcj RANDOM GENETIC DRIFT 267 Random Genetic Drift and Binomial Sampling 267 The Wright-Fisher Model of Random Genetic Drift 274 The Diffusion Approximation 277 Absorption Time and Time to Fixation 2S2 Parallelism between Random Drift and Inbreeding 283 Effective Population Size 289 Fluctuation in Population Size 290 Unequal Sex Ratio, Sex Chromosomes, Organelle Genes 292 Balance between Mutation and Drift 294 Infinite Alleles Model 294 The Ewens Sampling Formula 196 The Ewcus-Wallersou Test 298 Infinite-Sites Model 300 Gene Trees and the Coalescent 304 Coalescent Models with Mutation 308 Summary 310 Problems 312 MOLECULAR POPULATION GENETICS 315 The Neutral Theory and Molecular Evolution 315 Theoretical Principles of the Neutral Theory 316 Estimating Rates of Molecular Sequence Divergence 320 Rates of Ammo Acid Replacement 320 Rates of Nucleotide Substitution 324 Other Measures of Molecular Divergence 317 The Molecular Cluck 328 Variation across Genes in the Rate of the Molecular Clock 331 Table of Contents ix Variation across Lineages in Clock Rate 333 The Generafion-Time Effect 336 Does the Constancy of Substitution Rales Prove the Neutral Theory? 337 Patterns of Nucleotide and Amino Acid Substitution 338 Calculating Synonymous and Nonsynony- mous Substitution Rates 33H Within-Speaes Polymorphism 345 Implications ofCodon Bias 348 Polymorphism and Divergence in Nucleotide Sequence Data 349 Impact of Local Recombination Rates 353 Gene Genealogies 354 Hypothesis Testing Using Trees 356 Inferences about Migration Based on Gene Trees 360 Mitochondrial and Chloroplast DNA Evolution 361 Chloroplast DNA and Organelle Transmission in Plants 365 Maintenance of Variation in Organelle Genomes 366 Evidence for Selection in mtDNA 367 Molecular Fhylogenetics 368 Algorithms for Phytogenetk Tree Reconstruction 368 Distance Methods versus Parsimony 372 Bootstrapping and Statistical Confidence in a Tiee 372 Shared Polymorphism 373 Interspecific Genetics 374 Multigene Families 374 Causes of Concerted Evolution 375 Multigene Family Evolution through a Bnth and Death Process 378 Structural RAM Genes and Compensatory Substitutions 382 Multigene Stipe rfam il ies 383 Dispersed Highly Repetitive DNA Sequences 38 r y Summary 390 Problems 392 f^ QUANTITATIVE GENETICS 397 Types of Quantitative Traits 398 Resemblance between Relatives and the Concept of Heritability 400 Artificial Selection and Realized Heritability 406 Prediction Equation for Individual Selection 407 Selection Limits 411 Genetic Models for Quantitative Traits 414 Change in Gene Frequency 421 Genetic Model for the Change in Mean I'henotype 423 Components of Phenotypic Variance 424 Genetic and Environmental Sources of Variation 425 Companen ts of Geuotypic Vanat ion 430 Covariance between Relatives 434 Twin Studies and Inferences of Heritability in Humans 440 Experimental Assessment of Genetic Variance Components 442 Indirect Estimation of the Number of Genes Affecting a Quantitative Character 445 Norm of Reaction and Phenotypic Plasticity 448 Threshold Traits and the Genetics of Liability 452 Correlated Response and Genetic Correlation 454 Inference of Selection from Phenotypic ' Data '458 Evolution of Quantitative Traits 460 x Table of Contents Random Genetic Drift and Phenotypic Evolution 461 Mutation-Selection Balance 465 Quantitative Trait Loci 467 Mapping Genes thai Influence Qiiantilatiiv Cliaiacters 4-67 Significance Testing of QTLs 470 Composite Infernal Mapping and Other Refinements 471 What Have Wc Learned from Mapping QTLs? 473 Summary 476 Problems 479 SUGGESTIONS FOR FURTHER READING 483 ANSWERS TO CHAPTER-END PROBLEMS 487 BIBLIOGRAPHY 505 AUTHOR INDEX 52} SUBJECT INDEX 525 Preface Thanks in part to the power of molecular methods, population genetics has been rein- vigorated As some genome projects are approaching closure and methods of "func- tional genomics" are scaling up to identify the roles of novel genes, inevitably increas- ing attention is being paid to the significance of genetic variation in populations. Nowhere is this more evident than in medical genetics. Within a decade we can expect that all major single-gene inherited disorders will be iden- tified, genelically mapped, cloned, and char- acterized at a fine molecular level. Health professionals realize that this impressive feat will have an impact only on a small minority of individuals. Most of the genetic variation in disease risk is multifactorial, which means that the risk is determined by multiple genetic and environmental factors acting together. Killer diseases such as lamilial forms of cancer, diabetes, and cardiovascular disease fall into this category. The fact that these diseases aggregate in families implies that there is probably a genetic component, but the genetic component may differ from one family or ethnic group to another. Prompted by the high incidence of multifac- torial diseases as a group, the medical com- munity has become acutely aware of the need to understand the basic structure of genetic variation in populations in order to determine what aspects of the variation cause disease. The exciting practical applications ol population genetics to the analysis of multi- factorial diseases have received great atten- tion, but the scope of population genetics actually is much broader. Population genet- ics provides the genetic underpinning for all ot evolutionary biology. By "evolution" we mean descent with modification. Species undergo progressive genetic modification as they adapt to their environments, and new species arise as a by-product of this process. The intellectual excitement of biological evo- lution arises from the fact that it addresses the fundamental questions, "What are we?" and "Where did we come irom 7 " Patterns of evolutionary history are recorded in DN A sequences, and the appli- cation of population genetics to interpreting DNA sequences is revealing many secrets about the evolutionary past, including Ihe history of our own species. But population genetics embraces much more than the analysis of evolutionary relationships. It is particularly concerned with Ihe processes and mechanisms by which evolulionaiy changes are made The field is inherently rniiltidisciplinaiy, cutting across molecular biology, genetics, ecology, evolutionary biol- ogy, systematica, natural history, plant breeding, animal breeding, conservation and wildlife management, human genetics, sociology, anthropology, mathematics, and statistics. Students faking population genetics are usually expected to have completed, or to be taking concurrently, a course in differential calculus. While this book assumes a famil- iarity with the elementary notation for dif- ferentials and integrals, it does not require xii Preface great mathematical proficiency. We have kept the mathematics to a minimum. On the other hand, some ol the most important models in population genetics require quite advanced mathematics. Rather than ignore these approaches, we have made a concert- ed effort to present these models in such a way that the assumptions can be under- stood and the main results appreciated without much mathematics. References are provided for the interested reader to learn more about the details. Several important changes distinguish the third edition of Principles from the sec- ond edition. The level ol the treatment is more tailored to the needs of a one-semester or one-quarter course, with the intended audience being third- and fourth-year undergraduates as well as beginning gradu- ate students. Population genetics is not only an experimental science but also a theoreti- cal one. Special care has been taken to explain the biological motivation behind the theoretical models so that the models do not simply materialize out 0/ thin air, and to explain in plain English the implications of the results. Many concepts are illustrated by numerical examples, using actual data wher- ever possible. Special topics and examples are often set off from the text as boxed prob- lems whose solutions are explained step by step. Every chapter ends with about 20 prob- lems, graded in difficulty, and solutions worked in full appear at the end of the text. This edition of Principles is organized into nine chapters that gradually build con- cepts from measuring variation and the var- ious forces that influence genetic variation through a sequential progression to concepts of molecular population genetics and quan- titative genetics. The first chapter provides a background in basic genetic and statistical principles. We discuss the fundamental con- cepts of allelism, dominance, segregating. recombination, and population frequencies. The role of model building and testing in population genetics is emphasized. Chapter 2 introduces the student to the primary data of population genetics, namely, the many levels of genetic variation. Chapter 3 is con- cerned with the organization of genetic vari- ation into genotypes in populations. Here the Hardy- Weinberg principle gets very thorough coverage, including the cases of X- linkage and multiple alleles. Chapter 4 widens the perspective and considers the organization of generic variation among spa- tially structured populations, Population substructure is measured by Wright's F sta- tistics, and is presented in a way that con- veys their biological meaning. The Wahlund principle and inbreeding are also covered m Chapter 4 The goal of population genetics is to understand the forces that have an impact on levels of genetic variation. The forces of mutation, recombination, and migration are outlined in Chapter 5. Darwinian selection is the topic of Chapter 6, including both the theoretical foundations and empirical obser- vations of the dynamics of gene-frequency change under the action of selection. Hap- loid and diploid cases are developed, as are the concepts of equilibrium, stability, and context dependence. After classical models of mutation-selection balance are developed, a series of more complex scenarios of natural selection are presented. Chapter 7 deals with random genetic drift. In the absence of other forces, allele and genotype frequencies change as a result of random sampling from one generation to another. The Wright-Fisher model and d iffu- sion approximations are presented in such a way that the student gains an appreciation for the importance of random genetic drift. The process of the coalescence of genealogies is an important innovation in theoretical Prelace xiii population genetics, and some of the basic concepts of coalescence are presented in Chapter 7. In Chapter 8 we cover the rapidly ex- panding data on molecular evolutionary genetics. The unifying theme in the study of molecular evolution is Kimura's neutral the- ory, and a close examination is made of the correspondence between the data and theo- ry. This is a held in which advances in our empirical database and statistical tools for quantifying and manipulating the data are growing at a dizzying pace. Our goal is to give the student a firm grasp of the funda- mentals, and a deep enough understanding of the principles to identify important gaps in our knowledge. One intriguing aspect of molecular evolutionary genetics is the dis- covery of new phenomena and forces taking place at the molecular level that go beyond the realm of classical population genetics. Multigene families and organelle genomes are described in some detail to illustrate these uniquely molecular phenomena Chapter 9 covers the problem of quanti- tative genetics from an evolutionary perspec- tive. A compelling argument for using quan- titative genetics for the study of evolution is that adaptive evolution takes place at the level of the phenol ype, and quantitative genetics provides the tools for understanding transmission of phenotypic traits. Theoretical quantitative genetics is given special impor- tance by the paradoxes it raises in contrasting evolution at the levels ol the phenotype and of the DNA sequence. Our understanding of the correspondence between phenotypic and molecular differentiation is very incomplete, and our understanding of the correspon- dence between the rates ol morphological and molecular evolution is even less well developed As in the preceding chapters, we hope that the student is left with a feeling that there is plenty of room for imaginative work in this area Population genetics is a field with a bright and expanding future. ACKNOWLEDGMENTS This book was greatly improved by the efforts of many people. The staff at Sinauer Associates did a splendid |ob assisting us with the revision Nan Sinauer kept us on track, collecting and assembling dozens of computer files, revisions, FAXes, phone and email messages Chris Small oversaw the page layout and managed the art program. Andy Sinauer played an essential role in having the book reviewed and in giving helpful advice as to level and length. We are grateful to Chip Aquadro, James Jacobson, Trudy Mackay, Roger Milkman, Tim Prout, Glenys Thomson, and Ken Weiss for their comments on the previous edition. Their insights greatly improved the presentation in this one. Neither author could participate in writ- ing a book such as this without a support- ive, patient, sympathetic laboratory staff, able and willing to keep things running smoothly while the boss is at his word- processor doing a neuronal fusion with a sil- icon chip. We arc grateful to all of them. In Dan Hartl's laboratory, the list includes Lara Forde, Elena Lo/ovskaya, Dmitry Nurmin- sky, E. Fidelma Boyd, Allan Lohe, Javare Nagaraju, David Sullivan, Charles Hill, Dmitri Petrov, Mark Siegal, Daniel De Aguiar, Carlos Bustamante, Jeffrey Town- send, Jorges Vieira, Christina Vieira, Isabel Beerman, Yunsun Nam, Elizabeth Stover, and Susan Yuknis In Andy Clark's laborato- ry the acknowledgments include Michael Abraham, Joe Canalc, Manolis Dernntzakis, Chi is Fucito, Cnsfina Gonzalez, Jen Ionian, Angela Lambert, Brian La/zaro, |.P Masly, Flamish Spencei, Sarah Tishkoff, Bridget Todd, Can ic Tupper, and Lei Wang. CHAPTER 1 Genetic and Statistical Background Genes • Gene Expression Standard Error * Models Probability • Allele Frequency Estimates Population Growth a he science of population genetics deals with Mendel's laws and other genetic principles as they affect entire populations of organ- isms The organisms may be human beings, animals, plants, or microbes. The populations may be natural, agricultural, or experimental. The environment may be city, farm, field, or lorest. The habitat may be soil, water, or air. Because of its wide-ranging purview, population genetics cuts across many fields of modern biology. A working knowledge has become essential in genetics, evolutionary biology, systematks, plant breeding, ani- mal breeding, ecology, natural history, forestry, horticulture, conservation, and wildlife management. A basic understanding of population genetics is also useful in medicine, law, biotechnology, molecular biology, cell biology, sociology, and anthropology. Population genetics also includes the study of the various forces that result in evolutionary changes in species through time. By defining the framework within which evolution takes place, the principles of population genetics are basic to a broad evolutionary perspective on biology. From an experimental point of view, evolution provides a wealth of testable hypothe- ses for all other branches of biology. Many oddities in biology become com- prehensible in the light of evolution: they result from shared ancestry among organisms, and they attest to the unity of life on earth. Practical applications of population genetics are extensive Many applica- tions, particularly those relevant to human beings, also have important 2 Chapter 1 implications in elhics and social policy. Among the applications of population genetics in medicine, agriculture, conservation, and research are: • Genetic counseling of parents and other relatives of patients with heredi- tary diseases. • Genetic mapping and identification of genes for disease susceptibility in human beings, including breast cancer, colon cancer, diabetes, schizo- phrenia, and so forth. • Implications of population screening for carriers of disease genes, confi- dentiality of results, and maintenance of health insurability. • Studies of the heritability of 1Q score and its implications for affirmative action, welfare, and other social programs. • Statistical interpretation of the significance of matching DNA types found between a suspect and a blood or semen sample from the scene of a crime • Design of studies to sample and preserve a record of genetic variation among human populations throughout the world. • Improvement in the performance of domesticated animals and crop plants. • Organization of mating programs lor the preservation of endangered species in zoos and wildlife refuges. • Sampling and preservation of germ plasms of potentially beneficial plants and animals that may soon vanish from the wild • Interpretation of differences in the nucleotide sequences of genes or amino acid sequences of proteins among members of the same or closely related species The genetic and statistical principles underlying population genetics are for the most part simple and straightforward, but it may be helpful to preface the discussion with a few key definitions and concepts GENE EXPRESSION AND GENE INTERACTION Gene is a general term meaning, loosely, the physical entity transmitted from parent to offspring in reproduction that influences hereditary traits. Genes influence human traits such as hair color, eye color, skin color, height, weight, and various aspects of behavior — although most of these traits are also influenced more or less strongly by environment. Genes also determine the makeup of proteins such as hemoglobin, which carries oxygen in the red blood cells, or insulin, which is important in maintaining glucose balance in the blood. Genes can exist in different forms or states. For example, a gene for hemoglobin may exisl in a normal form or in any one of a number of alternative forms thai result in hemoglobin molecules that are more or less abnormal. These alternative forms of a gene are called alleles Genet tc and Statistical Background 3 From a biochemical point of view, a gene corresponds to a region along a molecule of DNA (deoxyribonucleic acid) DNA is the genetic material A molecule of DNAconsisIs of two strands wound around each other in the form of a right-handed helix (the celebrated "double helix") Each strand is a polymer of constituents called nucleotides, ol which there are four, conven- tionally symbolized A, T, G, and C according to the mtrogen-nch base that each contains — either adenine (A), thymine (T), guanine (G), or cylosme (C). The paired strands are held together by weak chemical bonds (hydrogen bonds) that form between A and T at corresponding positions in opposite strands or between G and C at corresponding positions in opposite strands (Figure 1.1). Wherever one strand contains an A, the other across the way contains a T; and wherever one strand contains a G, the other across the way contains a C. Because of the pairing of complementary bases — A with T and G with C — a double-stranded DNA molecule contains an equal number of A and T nucleotides as well as an equal number of G and C nucleotides DNA molecules can be very long. The DNA molecule in the bacterium E. coh is about 4.7 million base pairs, that in the largest chromosome in the fruit fly Drosophiln melatjogaster is about 65 million base pairs, and that in the largest human chromosome is about 230 million base pairs. Physical manipulation ol such large molecules is impractical In order to be studied, they must first be broken into smaller pieces. Gene Expression Most genes code for the polypeptide chains that constitute proteins. The code is the sequence of nucleotides along the DNA. In the decoding of Ihe nucleotide sequence in DNA and also in the synthesis of proteins, several Figure 1.1 Genes are fundamental units of genetic information thai corre- spond chemically to the sequence of nucleotides in a segment of DNA A mole- cule of duplex DNA is composed of two intertwined strands, each of which consists of a long sequence of nucleotides The strands are held together by pair- ing between the bases A and T in opposite strands and between the bases G and C in opposite stiands, The short diagonal lines indicate the paired bases There are 10 base pairs per turn of the double helix. A typical gene consists of hun- dreds of thousands of nucleotides, only a few of which are shown here 4 Chapter 1 types of RNA (ribonucleic acid) are essential RNA is also a polymer of nucleotides, each of which carries a base Three of the bases m RNA (A, C, and G) are the same as those in DNA. The fourth [uracil (U)l is different, When an RNA strand pairs with a complementary strand of DNA, U in the RNA pairs with A in the DNA. Hence, the base-pairing role of U in RNA is (he same as that of T in DNA. The essentials of gene expression in the cells of higher organisms (eukaryotes) are outlined in Figure 1 2. The coding regions of the DNA in a Coding region Intron Coding region 2 (A) transcription DNA 1 RNA transept 1 (B) RNA processing (C) Translation 3' + Messenger RNA 9 I Excised intron -Phc-His-Lys-Arg Ser-Ser-Pro-Tyr- Polypeptide Figure 1 .2 Processes in gene expression in euknryotic cells. (A) DNA regions coding for the ammo acids in a single polypeptide can be interrupted by non- coding regions (mtrons) (B) When the DNA is copied into RNA in transcription, both coding and noncoding regions are transcribed. However, the introns are removed from the transcript by processing. (C) In the messenger RNA, the cod- ing regions aie contiguous The messenger RNA is translated to form the chain ot linked amino acids constituting the polypeptide Genetic and Statistical Background S gene, which code for ammo acids, are often interrupted by one or more non- coding regions known as intervening sequences or introns In the first step m gene expression (transcription), a molecule of RNA is produced thai is com- plementary in base sequence to one of the strands of DNA (Figure 1 2A). Every gene includes a regulatory region (sometimes more than one) that determines when transcription takes place, the types of cells in which it takes place, and the strand that is to be transcribed. Because of the base pairing rules, a DNA sequence — say, 3-ATCG-5' — results in a complementary RNA sequence — in this example, 5'-UAGC-3'. Mote that the DNA and RNA strands each have a polarity or directionality. The terms 5' and 3' refer to the polarity of the strands. The 5' end typically terminates with a free phosphate group and the 3' end typically terminates with a free hydroxyl group ( — OH) When two strands of nucleic acid are paired, the polarity of each strand is opposite to that of the other. In the duplex DNA in Figure 1-2, for example, the left-to-right polarity of one strand is5'-to-3', whereas the left-to-right polarity of the partner strand is 3'-to-5'. Similarly, in transcription, the tem- plate DNA strand has a Jeft-to-right polarity of 3'-to-5', whereas the RNA transcript has the left-to-right polarity ol 5'-to-3'. Because of the complemen- tary base pairing between DNA and RNA nucleotides, the base-sequence code in DNA becomes converted into a base-sequence code in RNA. In tran- scription, the base sequence present in the introns is also faithfully copied into the base sequence of the RNA transcript. The second step in gene expression in eukaryotes is RNA processing (Figure 1.2B), The beginning and end of the RNA transcript are chemically modified and the introns are removed by splicing (cutting and rejoining). RNA processing results in a molecule called messenger RNA (in RNA), in which the coding regions have been made contiguous. The regions in the original RNA transcript that are retained in the mature rnRNA are called exons. The central part of the mRNA contains the spliced exons that code lor the amino acid sequence of a polypeptide chain. The mRNA also includes exons upstream and downstream from the protein- coding region. The upstream region is the 5' untranslated region and the downstream region is the 3' untranslated region The final step in gene expression is translation, in which the mRNA mol- ecule combines with ribosornes and other types of RNA molecules in the cytoplasm to produce the final polypeptide (Figure 1.2C). In the coding region of the mRNA, each adjacent group of three nucleotides constitutes a separate coding group or codon that specifies which amino acid is to be incorporated into the polypeptide chain. The ribosome moves along the mRNA in steps of three nucleotides (codon by codon). As each new codon comes into place, the correct amino acid is brought into line and attached to the end of the growing chain of amino acids. New amino acids are added to the growing chain until a codon specifying "stop" is encountered. At this point synthesis of the chain of amino acids is finished and the polypeptide is released from the ribosome 6 Chapter 1 In prokaryotes, which includes bacteria and other organisms lacking a nucleus, gene expression is essentially identical to that in eukaryotes except for the absence of RNA processing Genes in prokaryotes do not contain introns and so splicing is unnecessary In prokaryotes, the original RNA tran- script is used immediately as mRNA and translated into a polypeptide. Because there is no separate nucleus, translation in prokaryotes often begins immediately when the 5' end of an RNA transcript comes off the DNA and even before transcription of the 3' end of the same molecule has been completed. The central role of RNA in gene expression is one of the oddities of biolo- gy that makes sense in the light of evolution. That gene expression is config- ured around RNA is a legacy of the earliest forms of life when RNA molecules served both as carriers of genetic information and as catalytic mol- ecules. The role of RNA as carrier of genetic information was gradually replaced by DNA, and the role of RNA as catalytic molecules was gradually replaced by proteins. At every step along the way, as the RNA world evolved into the DNA world, the role of RNA was indispensable in the processes of information transfer and protein synthesis, and so the RNA intermediates became locked in place. The Genetic Code The genetic code is the list of all codons showing which amino acid each codon specifies. Table 1.1 shows the standard genetic code used in nuclear genes in most organisms. A few organisms and some cellular organelles, such as mitochondria, use slightly altered codes. The codons in Table 1.1 are those found in the mRNA. The amino acids are given by three-letter abbre- viations as well as by conventional single-letter abbreviations. Codon AUG is the start codon in polypeptide synthesis; it specifies methionine (Met) at the beginning of the polypeptide as well as at internal positions. Three codons are stops that result in termination of polypeptide synthesis: UAA, UAG, and UGA. The genetic code is redundant in that most amino acids are specified by more than one codon. Most of the redundancy is in the third codon position. A code for an amino acid is twofold degenerate if either of two sequences specifies the same amino acid. Twofold degenerate codes have the pattern -Y or R, where ■• stands for the bases m codon positions I and 2. The symbol Y stands for any pyrimidine base (either U or C); the symbol R stands for any purine base (either A or G) For example, CAU and CAC both code for histi- dine (His), fitting the pattern CAY; and CAA and CAG both code for gluta- mme (Gin), fitting the pattern CAR. A code for an amino acid is fourfold degenerate if any of four sequences specifies the same amino acid, fourfold degenerate codes have the form ■ N, where N means any nucleotide (U, C, A, or G). For example, GUU, GUC, GUA, and GUG all code for valine (Val), Genetic and Statistical Background 7 TABLE 1 .1 THE STANDARD GENETIC CODE Second nucleotide in codon }phe(T) Leu (L) S A uuu uuc UUA1. , M UUG/ Leu(L) cuu cue CUA CUG AUU| AUC AUA AUG Met(M») Jlle(L) GUU1 GUC GUA GUG Val (V) UCU UCC UCA UCG ecu ccc CCA CCG ACU ACC AC A ACG ecu GCC GCA GCG Ser(S) Pro (P) Thr (T) Ala (A) }t V i (Y) UAU UAC UAA Stop UAC Stop « u }h, s( h, CAA CAG \}° Gin (Q) AAU AAC AAA AAC }Asn(N) K JLys(K) GAU 1. , n , GAC J ASP(D) GAA GAG Glu (G) UGU \„ ._. ucc l Cys(C) UGA Stop UGG Trp(W) ecu CGC CGA CGG Ars (R) AGU l c .„. AGC J Ser(S) AGA AGG Arg(R) GCU GGC GCA GGG Gly (G) Note Codons are nonoverlapping three-base sequences present in mRNA, each of which spec- ifies an amino acid in a polypeptide chain or terminates synthesis ("Stop") 1 lie full names of the amino acids are phenylalanine (Phe), leucine (Leu), isoleucme (He), methionine (Met), valine (Val), serine (Ser), proline (Pro), threonine (Thr), alanine (Ala), tyrosine (Tyr), histidme (His), glutamine (Gin), asparagine (Asn), lysine (Lys), aspartrc aud (Asp), glutamic acid (Glu), cysteine (Cys), tryptophan (Trp), arginine (Arg), and glycine (Gly). which fits the pattern GUN, Note in 'Table 1 1 that the code for isoleucine is threefold degenerate and those for leucine, arginine, and serine are each sixfold degenerate. The codons for amino acids are not used randomly in proteins. There are preferred codons for amino acids that differ from one gene to the next and from one organism to another. Codon preferences exist even within redun- dancy classes. In Drosophila, for example, among codons for histidme, CAC is used more than CAU in a ratio of about 2 • 1. Similarly, among codons for glutamine, CAG is used more than CAA in a ratio of about 3 ■ 1 Anothei example of non random codon usage is the AUA codon for isoleucme, which tends to be avoided in most proteins in most organisms In Drosophi/a, AUU and AUC are used more than AUA in a ratio of about 10:1 One evolutionary 8 Chapter 1 hypothesis that explains the moidance of AUA is that, because of the degen- eracy of the genetic code, the AUA end on might sometimes he translated as AUG, which codes for methionine. Because methionine is likely to change pinlein structure radically, the mistranslation would be a costly mistake. Through evolutionary lime, one by one, the AUA codons in a messenger RNA become replaced with AUU or AUC, minimizing this type of misincor- poimtion error. This misincorporation hypothesis for AUA codon avoidance has not been tested, but it is testable Alleles Alternative alleles of a gene differ in their sequence of nucleotides (Figure 1.3) For example, where one allele has a T -A base pair in the DNA, another may haw a C-C base pair at the same position Because of redundancy in the code, rail all nucleotide substitutions result in a replacement of one amino acid for another. In Figure 1.3B, lor example, if a mutation at the third posi- tion in the second codon (asterisk) changes one pyrimidine into the other, the new codon still codes for histidine On the other hand, some nucleotide sub- stitutions at the third position do result in amino acid replacements. For example, in Figure 1.3C, if the third position in the second codon changes from a pyrimidine to a purine, the codon changes from one for histidine to one for glutamine. Most nucleotide substitutions at codon positions one and two result in amino acid replacements (Figure 1 2D). Not all alleles differ by a mere nucleotide substitution. Relative to the typ- ical or wildtype allele, some alleles may have a deletion of a number of nucleotide pairs or an insertion into the DNA molecule. The number of nucleotides deleted or inserted may be small (as few as one nucleotide pair) or large. Some insertions are thousands of nucleotide pairs in size. Many large insertions result from the activity of transposable elements, which are specialized sequences of DNA able to replicate and insert at novel positions virtually anywhere in the DNA of the organism in which they are present Alleles also may differ in the number of copies of short sequences present in tandem arrays in the DMA. For example, near many genes in human beings are tandem copies of dinucleotides, such as 5'-CACACACA . . . -3'. Such a repeating sequence is symbolized as (5'-CA-3')ji The number of copies («) of the dinuclcotide repeat often range from fewer than ten to hundreds, and the number of copies may differ dramatically from one allele to the next. Some alleles even differ from wildtype in having an inversion of the nucleotide sequence in a region of DNA. Genotype and Phenotype Within a living cell, genes are arranged in linear order along microscopic threadlike bodies called chromosomes. A typical chromosome may contain Genetic and Statistical Background 9 Nucleotide (0) DNA vvvv y -Lys- ▼ -Arg- Annino acids Polypeptide chain Figure 1 3 Alleles are alternative forms of a gene (A) The arrows show how the genetic information in a portion of the nucleotide sequence of DNA specifies the amino acid sequence in a portion of a polypeptide Each group of three adja- cent nucleotides corresponds to one amino acid in the polypeptide (B, C, D) Substitution of one nucleotide for another m the DNA (indicated by the aster- isks and heavy lines) can result in the replacement of one amino acid for anoth- er in the polypeptide several thousand genes The position of a gene along a chromosome is called the locus of the gene. In most higher organisms, each cell contains two copies of each type of chromosome. Such organisms, in which the chromosomes are present in pairs, are said to be diploid In each pair of chromosomes, one 1 Chapter 1 Genetic and Statistical Background 1 1 member is inherited from the mother through the egg and the other is inher- ited from the father through the sperm At every locus, therefore, diploid organisms contain two alleles, one each at corresponding positions in the maternal and paternal chromosomes If the two alleles at a locus are chemically identical (in the sense of having the same nucleotide sequence along the DNA), the organism is said to be homozygous at the locus under consideration; if the two alleles al a locus are chemically different, the organ- ism is said to be heterozygous at the locus. The term gene is a general term usually used in the sense of locus. Geneticists make a fundamental distinction between the genetic constitu- tion of an organism and the physical or biochemical attributes of the organ- ism. The genetic constitution of an organism is called the genotype; genotype thus refers to the particular alleles present in an organism at all loci that affect the trait in question. For example, if a trait is influenced by two genes, each with two alleles, then there are nine possible genotypes, as follows. AA;BB AA,Bb AA;bb Aa,BB Aa;Bb Aa;bb aa;BB aa;Bb aa;bb where A and a refer to the alleles of the first gene and B and b refer to the alle- les of the second gene. In some cases when the genes are linked (located in the same chromosome), it is sometimes necessary to distinguish between the genotypes AB/ab and Ab/aB, in which case there are ten possible genotypes. In contrast to genotype, the physical expression, of a genotype is called the phenotype. Examples of phenotypes include hair color, eye color, height, weight, number of kernels on an ear of corn, number of eggs laid by a hen, and round versus wrinkled pea seeds The distinction between the genetic constitution of an organism (genotype) and the physical or biochemical attributes of the organism (phenotype) is particularly important in cases in which the environment can affect the tiait; in such cases, two organisms with the same genotype can nevertheless have different phenotypes because of differences in the environment, Conversely, two organisms with the same phenotype can have different genotypes. PROBLEM 1.1 WageneinacKploidc^arusmhasmalfe^ftvealfel^ ' '} show that the number of possible genotypes equals m(m + 2)/2. ' 'T ANSWER:, Consider; first the* heterozygoses. There are m ways of choosing the first allele and, having done that, there are m - 1 ways of choosing a different $ecohd allele. Altogether, there are m{m - l)/2 '.] different heterorygotes. The division by 2 is necessary because, for ■each hfiterozygote---say, AfAp-ty makes no difference whether A t was ,, dhpsen first and A t second pr tibe other way around. In addition to the ^heteroaygtrtes, flWre <*re w possible hotrtozygotes. Hence, the total rnanber of diploid genotypes equals [m(tti - 1)/2J + m = m(m + l)/2. Dominance and Gene Interaction Whether each genotype has a single, unique expression of the trait depends on the manner in which the alleles of a gene interact in development For the alleles of one gene, dominance refers to the concealment of the presence of one allele by the strong phenotypic effects of another. For example, with two alleles there are three possible genotypes: AA Aa aa Several types of dominance are distinguished and exemplified in the fol- lowing examples: • Complete dominance: A is completely dominant to a if the phenotypes of AA and Aa cannot be distinguished. • Incomplete dominance: A shows incomplete dominance with respect to n if the phenotype of Aa is intermediate between that of AA and that of aa. This situation is also referred to as partial dominance or intermediate dominance. When the phenotype can be measured on a quantitative scale, for example, the number of kernels on an ear of corn, and the phe- notype of Aa is exactly the average between that of AA and that of aa, then the alleles are said to be additive alleles and the type of dominance is sometimes called semidominance. • Codomwance: A and a are codominant if the products of both alleles can be detected in An heterozygotes. Many alleles are codominant at the level of their protein products because two different forms of the polypeptide, encoded by A and a, can be detected in heterozygotes At the level of the DNA sequences, all alleles differing in DNA sequence are codominant. L 12 Chapter 1 It is important to note lhal dominance is not a characteristic of alleles so much as a characteristic of the manner in which the phenotype is examined. An allele may show complete dominance if the phenotype is examined in one way, no ■dominance if examined in another, and codorniinan.ee if exam- ined in stilt another. For example, the allele for round pea seeds W studied by Grogor Mendel is completely dominant to that for wrinkled seeds w when the phenotype "round" veisus "wrinkled" is examined. The genetic defect in wrinkled seeds is the absence of an enzyme needed for the synthesis of a brartched-chain form of starch. Microscopic examination reveals subtle dif- ferences in the form of the starch grains in seeds of the three genotypes: WW seeds contain large, well-rounded starch grains, retain water and shrink uni- formly as they ripen, so the seeds do not become wrinkled, ww seeds lack the branched-chain starch and are irregular in shape because the ripening seeds lose water more rapidly and shrink unevenly. However, heterozygous Ww seeds have starch grams that are intermediate in shape even though the seeds shrink uniformly and show no wrinkling. Therefore, at the level of the starch grains, there is incomplete dominance of IV and w because the starch grains in the heterozygotes are intermediate between the two homozygofes. Fur- thermore, the difference in DNA sequence between W and w can readily be detected with modern methods, so that W and w are codominant at the level of DNA sequence. For traits affected by more than one gene, the relation between geno- type and phenotype depends not only on the degree of dominance of the alleles of each gene but also on the type of interaction between the genes in development. For example, suppose that the trait in question is degree of pigmentation and that pigmentation is determined by two alleles of each of two genes, say, A, a and B, h. Suppose further that the total amount of pigment in an organism results from the total number of A and B alleles piesent, each of which adds a single unit of pigmentation to the pheno- type. 1 hen, as shown in Table I 2, there are only five possible levels of pig- mentation (0 through 4) and genotypes aa BB, An Bb, and AA bb all have the same phenotype. Because each uppercase allele adds the same quantity to the total phenotype, the type of gene interaction in Table 1.2 is said to be additive. Segregation and Recombination The essential mechanism of inheritance was established by Gregor Mendel (1822-1884) in experiments with garden peas carried out in the years 1856 to 1863 in a small garden plot next to the monastery in which he lived. Mendel showed that the alleles of each gene segregate from one another in the for- mation of reproductive cells or gametes. Because of segregation, heterozy- gous genotypes form equal numbers of gametes containing each allele. Genetic and Statistical Background 1 3 TABLE 1.2 A MODEL OF THE ADDITIVE GENE ACTION 3 Genotype Amount of pigmentation AA BB An BB, an BB, Aa hb, <ia bb AA Bb Aa Bb, aa Bb AA bh "At left arc shown Hue nine possible genotypes of two genes wilh two alleles of each gene At ngtot is shown the amount of pigmentation expected in each genotype when it \s assumed that each allele designated by an uppercase letter is responsible for producing a certain amount of pigment ''Measured as an increase in pigmentation ovet that in aa bb genotypes Furthermore, because gametes unite at random in fertilization, the following are the results of simple Mendelian segregation: • AAxAA matings produce all A A progeny. • AA x Aa matings produce '/ 2 M and V 2 Aa progeny. • AAxaa matings produce all Aa progeny. • AaxAa matings produce l / 4 A~K V 2 Aa, and V 4 aa progeny. • Art x aa matings produce V2 Aa and '/ 2 aa progeny • ctoxna matings produce all aa progeny. The physical basis of Mendelian segregation is that the maternal and paternal pairs of chromosomes are separated into different cells in the forma- tion of gametes, Prior to their separation, the maternal and paternal chromo- somes associate intimately all along their length and alleles may be interchanged in the process of recombination (Figure 1 4). The interchange of parts takes place after the chromosomes have replicated, and only two of the lour chromosome strands participate in any one exchange. Recombination results in the creation of allele combinations different from either parental chromosome, in Figure 1.4, the /I b and a B combinations are recombinant, whereas the A Band a b combinations are parental (nonrecombinant) There- fore, a single exchange between parental chromosomes results in I wo recom- binant and two nonrecombinant gametes. In organisms with an XX-XY chromosomal mechanism of sex determina- tion, Mendelian segregation randomizes the sex ratio at fertilization In mam- mals and many other animals, sex is determined by sex chromosomes: males have an X and a Y chromosome, and females have two X chromosomes In males, the X and Y chromosomes segregate, yielding equal proportions of 1 4 Chapter 1 § ' 1/4 I "' 1/4 ■i Figiure 1 .4 Recombination results from a physical interchange of pai Is between chromosomes. New combinations of alleles are created that differ from eithei paieiil.nl chromosome I he physical inlei change of parts takes place in gamete formation after the chromosomes have replicated, and only two of the foui chiomosomr strands pai licipate in any one exchange Genetic and Statistical Background 1 5 X-bearing and Y-bearing sperm. If both types of sperm are equally able to fer- tilize eggs, then random union of sperm with eggs yields l / 2 XX (female) and V 2 XY (male) chromosome constitutions. PROBABILITY IN POPULATION GENETICS The basic concepts of probability needed for elementary population genetics are quite straightforward They will be introduced with the concrete example of genetic segregation in Figure 1 .5, whrch deals with the progeny of the mating (A) Addition rule Mating AaxAa Offspring ^AA i ^Ati + | flu A- means "Offspring either AA or Aa" (B) Multiplication rule Birth Order Sibship 1 2 3 Probability 1 A- A- A- 3 3 1 27 ■t * 4 * -1 64 2 A- A- mi ■1 y I x ] - - - 9 4 4 4 " 64 3 A- aa A- 4 - 4 4 (V| 4 aa A- A- 13 3 9 4 X 4 * 4 ~ 64 5 A- an aa 4 "4 4 64 6 aa A- aa ! * 5 x ! _ 2 4 K 4 * 4 ~~ 64 7 aa aa A- 113 3 4 * 4 * 4 ~ 64 B aa aa aa 4 4 4 64 Figure 1,5 Basic concepts of probability illustrated by Mendelian segregation in the mating Aa x Aa. The elementary outcomes of the mating are the possible genotypes of each progeny-— /l/l, Aa, and aa — and these aie realized with proba- bilities !/,, !/ 2 , and %, respectively. (A) The compound event A- consists of the two elementary outcomes AA and Aa, and the probability of A- is the sum of the probabilities of these elementary outcomes (addition rule) (B) The possible distributions of genotypes A- and aa in sibships of size three offspring. Succes- sive births are independent, and so the probability of any sibship equals the product of the probabilities for each birth separately (multiplication rule) 16 Chapter 1 Aa x A? Considerations in probability always begin with an experiment of some kind. The experiment may be either a real experiment or a conceptual experiment In Figure 1.5, it is a conceptual experiment in which Aa is crossed with An. In probability calculations, U is also necessary lo define all possible outcomes of the experiment The outcomes are called elementary outcomes because they are defined in such a way that, in any repetition of the experi- ment, one and only one of the elementary outcomes must be realized. For example, if we are interested in the genotypes among the progeny of (he mat- ing An, the possible elementary outcomes for each offspring are either AA, Aa, or aa (Note that, in defining these as the elementary outcomes, we are ignor- ing the possibility of either A or a mutating to a novel allele ) To proceed fur- ther, we must assign to each elementary outcome a probability, a number between and 1 that measures how much confidence we have that the out- come will be realized. The probabilities assigned to the outcomes are based on genetic reasoning, intuition, or experience. One requirement of the assigned probabilities is that the probabilities of all the elementary outcomes must add to I, this is the mathematical consequence of requiring that one of the elemen- tary outcomes must be realized For example, i f there are three elementary out- comes, and all are equally probable, then each has a probability of %, In Figure 1 5, the probabilities assigned to the elementary outcomes AA, Aa, and aa are ! /4« '/;>, and V 4 , respectively, because these are the relative proportions of the three progeny genotypes expected from Mendelian segregation. The Addition Rule An outcome of a conceptual experiment is an event. The distinction between an event and an elementary outcome is that an event can include more than one elementary outcome For example, in Figure 1.5A, the event "the off- spring has at least one copy of the dominant A allele" consists of two ele- mentary outcomes, namely, genotypes A A and Aa This event may be sym- bolized A-, where the dash indicates that the unspecified allele may be either A or a For events defined in terms of elementary outcomes, the probability of an event equals the sum of the probabilities of the elementary outcomes included in the event. In the present example, Vr{A~) = Pr(AA) + Pr(Aa) = '/, + i/ z = % More generally, two events are mutually exclusive if they cannot be real- ized simultaneously. The addition rule slates that, for mutually exclusive events, the probability that either one or the other is realized equals the sum of the probabilities of the separate events. The Multiplication Rule Figure 1 SB shows all possible genotypes of sibships of three offspring from the mating Aa x An, with each offspring classified as A- versus aa. (A sibship Cenetfc and Statistical Background 1 7 is a group of brothers and sisters ) The probability of A- in any particular birth is 7 4 and that of aa is V 4 . The probabilities at the right are the overall probabilities for each ol the sibships. They are obtained by multiplication of the probability for each birth because successive births are independent, which means that the genotype of any birth has no effect on the genotype of any other birth. Because of the independence, among the % of the sibships with A- in the first birth, Y 4 will have A- in the second birth, and among the % x % of the sibships with A- in the first two births, % will have A- in the third birth. Therefore, the overall probability of three A- births is % x 3 / 4 x 3 / 4 . The reasoning for the other types of sibships is similar. More generally, the multiplication rule states that, whenever two events are inde- pendent, the probability of their joint realization is the product of the prob- abilities of their being realized separately. Repeated Trials The sibships in Figure 1 5B are an example of repeated trials of a conceptu- al experiment. Repeated trials are encountered frequently in probability They govern tosses of a coin or dice, deals of cards, successive spins of a roulette wheel, and so forth. Repeated trials are also important in population genetics because successive offspring of a mating are independent events and thus repeated trials Furthermore, it is apparent from Figure 1 5B that the different birth orders are mutually exclusive - any sibship can have one and only one birth order of A- or aa. Because the birth orders are mutually exclu- sive, their probabilities may be combined by the addition rule. Hence, the composite events below have the following probabilities' Pr(two A- and one aa) = 9/64 + 9/64 +- 9/64 = 27/64 Pr(one A- and two aa) = 3/64 + 3/64 + 3/64 = 9/64 Note in Figure 1.5B that, when the sibships with the same number of A- and m genotypes are combined, the overall probabilities are given by succes- sive terms in the expansion of: V/ i A-+^aa?=\x{y 4 f + 3x(-y 4 )W + 3x(3/ 4 )'(i/ 1 ) 2 + lx(V 4 ) 3 A- A- A- A- A- aa A- aa aa aa aa aa The coefficients 1 : 3 : 3 : 1 are the number of combinations in which each triad of genotypes can be born: 1 for A- A- A-, 3 for A- A- aa (because the aa genotype can be born either first, second, or third), and so forth. Each power of -y 4 and 1 /a is the probability that any one of the birth orders will be realized, 1 8 Chapter 1 for example, (Y 1 ) 2 ( l / 4 ) 1 is the probability that any sibship with two A- and one aa genotype will be realized fn nil 1 cases of repealed and in dependent trials, the overall probabilities are given by analogous expansions Suppose that any one trial may result in either of two mutually exclusive events, A or B, and that the probability of event A is p and that of event B is q (with p + q = 1). Among a total of it inde- pendent trials, what is the probability that A is realized exactly r times and B is realized exactly n - r times 7 By the multiplication rule, any particular com- bination of r /Is and n -r ffc has a probability p'q"~' Deducing the total num- ber of combinations of > As and n - r Bs is a little less obvious, but it is given by the coefficient of the term p'tf~' in the expansion of (p + q}", which equals r\{u-r)l 1.1 where the exclamation point means the factorial, the product of all integers from 1 thiough (he number m question. For example, u! = ]x2x3x • - - xii For consistency, the number 0! is defined as 0! = 1. Equation 1.1 is often called a binomial coefficient because it arises in the expansion of the two terms (p +• q)". To understand the reason why Equation 1 I yields the correct number of combinations of r As and {n - r) Bs, first consider what the »' means. It is the total number of ways that any set oi it objects can be arranged in order. There are /; ways to choose the first object and, having chosen the first, u - I ways to choose the second and, having chosen the first two, » - 2 ways to choose the third, and so on, yield- ing n x {n - 1) x (n - 2) x ■ ■ - x 1 = n\ Fiu thermore, for each arrangement of n objects of which r are As and(» - r) are Bs, there are r! ways to arrange the As among themselves and (n - r) 1 ways to arrange the Bs among themselves, for a total of rl x (n - r)! arrangements. Because each of the w' combinations of r As and (n - rj Bs includes r! x {h - r)l equivalent arrangements of the As and Bs, the total number of different arrangements of r As and (n - r) Bs equals the ratio given in Equation 1 I. Equation I 1 gives the number of different arrangements of r As and (if-r) B.s. Each arrangement has a probability given by p'tf' 1 . Therefore, using the addition rule, Ihe probability that n repeated trials yields r realizations oi A and (n - r) realizations of ft equals >'(<!-, •)' p'q" ' 1 2 As an example of the use of Equation 1 2, consider the probability (hat a sibship of 12 offspring from the mating An x An perfectly matches the Genetic and Statistical Background 1 9 expected N4endelian ratio of 9 A- and 3 aa. In this case, p = -%, q = %, n - 12, r = 9, and n - r = 3. The required probability from Equation 1 2 is therefoie i^-(l) fit =220x0.0751x0 0156 = 0.258 9!3H4 The implication of this calculation is that, whereas Ihe "expected" ratio is 9 A- : 3 aa, only a little more than 25% of such sibships actually have the expected distribution. f^OHtiM 1 .1 Sfctppd^ tjrwrt a society decided to limit the number !'^^:!aijM\lty; reproduction to any '* ;#»Mit \WM giv^s birth "tA ■* :male child. Given a ratio of males to l' 1 ^ females' at!b,^fh'(rf I 'i 1/fuW would such a law affect the sex ratio? !*;; Sttppo&e' ffirthar that, ki practice, any woman who has a female child |fi: ''Wkmtarily t^ra^feftirite r^roduction with probability p. In this ' ' ' 'ett^'ftvtat.is the prD^ortaorfc Of males in sibships of size w? ■ jtitfe'tif^ Wdttid have nis effect on the sex ratio. To under- ptot^Akk &e fiirBt birth across the entire population. The fcifit!ft«g th^dfep^iitg mm be 50% males. Consider now the tfcfrih. Th&se* Mtbj among these offspring must also be 50% ^)jyt Iftfc '«Ht Mi In arty birth must be 50% males, and so "'"" fft^lfe^llli^ukttfJftbf births as a whole. In regard to i Jftrt trf^JftcfcWi,, hole that sibships of size n can be sep- , ;$iMjjKt1ft& 'bvo'ei^ses: ttk«* irt Whkh the final birth is a male (and ^^kot^'afui^^ is denied) and those in which the g -"final birth is it girl (in which the mother voluntarily stops reproducing 'J.' with probability p). these types of; sibships occur in the ratio V*: P. 'pill WWch tenis fai the proportions 1/{1 + p) and p/(l + p), respec- P tf^djfc Tftfe first type ttf Sibship has a proportion of males of l/« and p; . tiw Siscdnd has a propqrtioit of males of 0. Hence, the proportion of f ' males as a function of sibship sisse equals (1/n) x [1 /(I + p)] + x Li; .1^/(1 + p)l '* 1 /jMji + p). Note that, for ^) = 0, the proportion of males as f l '"it function of sibship size decreases according to the series 1, V^ V3, j| : , V* .''; >,. Nffterfrteless, the sex ratio in the population as a whole equals '; ; !% for tilis arid any other value of p. 20 Chapter 1 PHENOTYPIC DIVERSITY AND GENETIC VARIATION One of the universal attributes of natural populations is that organisms dif- fer in phenolype with respect to manv Ira its Phenotypic diversify in many Ira its is impressive even with the most casual observation Among human beings, for example, there is diversity with respect to height, weight, body conformation, hair color and texture, skin color, eye color, and many other physical and psychological attributes or skills. Population genetics must deal with this phenotypic diversity, and especially with that portion of the diver- sity that is caused by differences in genotype. In particular, the field of pop- ulation genetics has set for ilself the tasks of determining how much genetic variation exists in natural populations and of explaining its origin, mainte- nance, and evolutionary importance. Genetic variation, in the form of multi- ple alleles of many genes, exists in most natural populations, hi most sexu- ally reproducing populations, no two organisms (barring identical twins or other multiple identical births) can be expected to have the same genotype for all genes. Thus, it becomes important to describe how alleles in natu- ral populations are organized into genotypes — to determine, for example, whether alleles of the same or different genes are associated at random. Alkie frequencies in Populations Much of the phenotypic variation in natural populations does not yield sim- ple Mendelian segregation ratios such as 1 : 1 or 3 : 1 in pedigrees. Some dif- ferences in phenotype are environmental m origin and so are not expected to show Mendelian segregation. However, simple Mendelian segregation is not usually observed even for traits whose expression is influenced more or less strongly by genetic factors. Although the underlying genetic factors do seg- regate in pedigrees in Mendelian fashion, the segregation is concealed by several complications First, environmental effects on the trait may be strong enough to mask the genetic segregation. Second, genetic effects on many trails are determined by the joint effects of the alleles of two or more genes, and the segregation of any one gene in a pedigiee may be obscured by the segregation of others. On the other hand, some phenotypic diversity in populations does show simple Mendelian segregation, In the snapdragon Antirrhinum majus, for example, whether the flower color is red, pink, or white is determined by the alleles I and ( of a single gene. The genotypes II, It, and ii have red, pink, and while flowers, respectively, an example of incomplete dominance. Populations containing both the / and i alleles will include plants whose flowers are red (11), pink (//), or white (n) in proportions determined by the allele frequencies of the I and i alleles in the population as well as by the manner in which the alleles are united in fertilization By the allele frequen- cy of a specified allele, we mean the proportion of all alleles of the gene that are of the specified type. To take a hypothetical example, suppose 400 Genetic and Statistical Background 21 members of a population were classified as to flower color and the finding was: 165 red, 190 pink, and 45 white Because the flower color reveals the genotype, we may infer that the sample of 400 includes 165 U, 190 h, and 45 n genotypes. The observed numbers of / and / alleles are therefore - 1-2x165 + 190 = 520 r 190 + 2x45 = 280 The factors of 2 are included for the homozygous genotypes because each II genotype contains two 1 alleles and each ii genotype contains two i alleles. The total number of alleles in the sample equals 2 x 400 = 800 Therefore, if we let p represent the frequency of the / allele and q represent the frequency of the i allele (with p + q = 1 because these are the only alleles of the gene in question), then we can estimate p and q from the observations as: p = 520/800 = 0.65 q = 280/800 = 0.35 Mote that, if the / and i alleles were combined into genotypes at random, the expected frequencies of three genotypes can be calculated from the rule for repeated trials by expanding the binomial (p J + q if = p 2 Jl + 2pq li + q 2 ii. Therefore, assuming random combination into genotypes, the expected num- bers of the three genotypes are; II: (0.65) 2 x 400 = 169 Ik 2 x 0.65 x 0.35 = 182 ii: (0.35) 2 x 400 = 49 Hence, the observed numbers in this hypothetical population are very close to those expected with random combinations of alleles. The proportions p 2 , 2pq, and q 2 for the three genotypes when two alleles are combined at ran- dom constitutes the Hardy-Weinberg principle, which is one of the basic principles in population genetics. The Hardy-Weinberg principle is discussed in detail in Chapter 2. i . ''Pl$0tlMA 1.3 Suppose mat a random sample of 400 snapdragons V jfipbM A population includes 185 red, 150 pink, and 65 white. Estimate ! the allele frequency p of I and q of i. Assuming random combirta- i tiohs of alleles in the genotypes, what are the expected numbers of 22 Chapter 1 the three genotypes? Do the observed data seem to fit the expecta- tions? ANSWER Among the total of 800 alleles, the observed number of I alleles is 2 x 185 + 150 = 520 and that of i alleles is 150 + 2 x 65 ~ 280. Therefore, p = 520/800 = 0.65 and q= 280/800 = 0.35. Note that ihe estimated allele frequencies are the same as above, even though the observed numbers of the genotypes are different. With random com- binations of alleles in the genotypes, the expected numbers are again 169 red, 182 pink, and 49 white. Compared to the observations, there appear to be too many homozygous genotypes and too few heterozy- gous genotypes. (A statistical method for deciding whether the fit is satisfactory or not is discussed in Chapter 2.) Parameters and Estimates In the discussion of flower color in snapdragons, we made a subtle distinc- tion between the actu.il allele frequency of the I allele (designated p) and the estimated allele frequency of the f allele (symbolized p). The distinction is necessary whenever an experimenter makes inferences about an entire pop- ulation from an examination of a random sample from the population. Quantities used in describing entire populations arc parameters. In the snap- dragon example, the parameter of interest is the allele frequency p of / in the entire population. Because we only have access to a sample of 400 organisms from the population, the true value of r> is unknown. The best we can do is make an estimate of p based on a sample, hoping that the sample is repre- sentative of the population as a whole. The estimate obtained from the sam- ple is designated p to emphasize that it is an estimate rather than the true value. In this book, whenever it is necessary to distinguish parameters from their estimates, we use uncmbellished symbols foi paiameters (for example p for the unknown frequency of an allele in a specified population) and the same symbol with a cucumflex for the estimated value (in this example;;) The Standard Error of an Estimate The distinction between a parameter and an estimate is important because differenl samples may yield different values of the estimate for the same rea- son that different sibships may yield different segregation ratios, namely, chance variation from one icpeated trial to the next The estimation ol an Genetic and Statistical Background 23 allele frequency can be treated as repeated trials by supposing that the alle- les are sampled at random, one by one, from a very large population. In the snapdragon example, there are 800 alleles sampled ff the allele frequency of } has the true value p - 0.65, then the repeated-trials interpretation implies that all possible outcomes of 800 trials have probabilities given by successive terms in the expansion of (0.65 J + 0.35 i) m . This is not an expansion that one would want to do by hand, but the binomial expression makes evident the underlying random-sampling process that accounts for variation in the esti- mate of p from one sample of 800 alleles to the next. Unless p is quite close to or quite close to 1, there is a convenient approx- imation to the binomial expansion {p ! + q \)'\ where n is the number of alleles sampled. As n becomes large, the distribution of p approaches the familiar bell-shaped curve called the normal distribution. The normal distribution fea- tures prominently in the analysis of traits determined jointly by multiple genetic and environmental factors and it is discussed in detail in that context (Chapter 9). For present purposes, it is sufficient to note that the degree to which the values of p are clustered around the overall average depends on a quantity called the standard era»r: PI 1.3 where q = 1 - p. li the sampling and estimation of p were repeated many times using the same population, then the values of p would be expected to be clustered symmetrically around p according to the standard error as fol- lows: • Approximately 68% of the estimates p lie within plus or minus one stan- dard error of p. • Approximately 95% of the estimates p lie within two standard errors of p. • Approximately 997% of the estimates p lie within three standard errors of p. To put the matter in another way, with repeated sampling, 32% of the esti- mates would be expected to differ from the true value by more than one stan- dard error, 5% by more than two standard errors, and only 0.3% by more than three standard errors. As an illustration of the variation among repealed estimates of p, Figure 1.6 shows the values of p obtained in 100 repetitions of the experiment of sampling 800 alleles from a large population in which the true allele frequen- cy is p = 0.65. Each of the 100 samples was created by computer simulation using a random-number generator that yielded a 1 with probability 65 and a with probability 35. For each sample of 800, therefore, the estimate p equals the number of Is in the sample divided by 800, As is evident in Figure 1 6, the distribution of p values is more or less bell-shaped but not 24 Chapter 1 Dovinlinn from mean (Sir") -3 -2 -I t-1 \1 -H Allele frequency estimate I I Figure 1.6 Estimates of allele frequency based on 100 samples, each of size 400 diploid organisms, from a population in which the actual allele frequen- cy is 0.65 The standard error equals 0,017, and the distribution of the esti- mates is very close to the bell-shaped distribution expected theoretically. The scale across the top gives (he ranges of the estimates as multiples of the standard error exactly so because it is based on only 100 samples rather than an infinite number. The overall mean p from all 100 samples combined (80,000 observa- tions) equals 0.6492, which is very close to (he frue value off). Furthermore, the distribution of the estimates fits the predictions based on the standard error quite well. To apply Equation 1.3 to the data in Figure 1.6, note first that p = 0.65 with n = 800, and so s in Equation 1 3 equals /[(065~x 0.35)/ 800] =0.017. Because 68% of the samples are expected to yield values of p in the range p ± .■?, and because the expected distribution is symmetrical, 34 of the values in Figure 1.6 are expected in the range p-sto p (0.633 -0.650) and 34 in the range p to p + s (0.650-0 667); the actual numbers are 33 in the first interval and 35 in the second. By the same reasoning, 95% of the values should lie in the range p + 2s, or 47.5% on each side of the mean; because 34% of the values on either side of the mean are in the range p ± s, the implication is that 47.5 - 34 or 13 5% of (he values should lie in (he range p- 2s to p - s and 13.5% should lie in the range p 4 s lo p + 2s For the data in Figure 1 6, these ranges are 0.6J6 -0.633 and 667-0 684; the actual number in each interval is 18 and 10, Genetic and Statistical Background 25 respectively, as against the theoretical 13.5 m each. Likewise, the standard error predicts thai 0.3% of the samples will deviate by more than 3s from the mean, as compared with the observed 2. Estimates and their standard errors are often presented as p ± s, or 65 ± 017 in the present example. The 68%, 95%, and 99 7% cutoffs for ± 1, ± 2, and ± 3 standaid errors provide one mannei in which the reliability of an estimate may be interpreted. Estimates may also be presented alternatively in terms ol a range called a confidence interval, which expresses a degree of confidence that the true value of a parameter lies in some specified interval The most frequently encountered confidence interval is the 95% confidence interval, defined as the interval (p - 2s, p + 2s) Because 95% of repeated sam- ples are expected to yield estimates in a range ± 2s around the true mean, then 95% of the time the interval (p -2s) - (p +- 2,s) is expected to include the true value of the parameter p. In the snapdragon example with p = 0.65 and s = 017, the 95% confidence interval isO.616-0.6B4. rw^wts**^ «Mfl* : ' '.'(•JUtWtftM 1 *4 ;The''Mrj| bloodt .groups in human; beings are deter- giii^tytwri alleles of a single gette^ designated M and IV, Each allele results trt th£ production of a different type of polysaccharide mole* oileGrt the Surface Of red Wood cells, which can be distinguished by mean* Of appropriate chemical reagents. The types of molecules cor- responding to the M arid N alleles are designated M and N, respec- tively, the M arid N alleles are codominant; that is, genotype MM produces only the M substance and has blood group M, genotype NN ptedtites only the N substance and has blood group N, and the het- erozygous genotype MN produces both the M and N substances and has blood group MN, Among a' sample of 1000 British people (Race atid Sanger 1975), the observed numbers of each blood group were "life H, 489 MN, and 213 N. Using these data, estimate the allele fre- t f of the M allele and calculate its standard error. What are the M%, and 99.7% confidence intervals for pi ANSWER because each genotype has a unique phenotype, the sample, contains 2 x 298 + 489 = 1085 M alleles, and so p = 1085/2000 ; ~ G.542S, The standard error s = ^(0.5425)0 0.5425)/ 2000 = 0.0111. The 68%, 95%, and 99.7% confidence intervals for p are p ± Is, 2s, and 3s* respectively, and so the confidence intervals are 0.5314 - 0.5536 (68%), 0.5202 - 0.5647 (95%), and 0.5092 - 0.5758 (97.5%). 26 Chapter 1 MODELS IN POPULATION GENETICS Population geneticists must contend with factors such as population size, patterns of mating, geographical distribution of organisms, mutation, migration, and natural selection Although we wish ultimately to under- stand the combined effects of all these factors and more, the factors are so numerous and interact in such complex ways that they cannot usually be grasped all at once. Simpler situations are therefore devised, situations in which a few identifiable factors are the most important ones and others can be neglected An intentional simplification of a complex situation is a model. There are several types of models, each designed to eliminate extra- neous detail m order to focus attention on the essentials. Some models are experimental An experimental model may consist of a laboratory experi- ment with population cages of Drosoplnla or growing cultures of bacteria. An experimental model may also consist of observations of natural popula- tions in particular locations or at particular times in which evolutionary forces of interest may be presumed to be present. Models of this type include the study of the origin and spread of insecticide resistance in insects or antibiotic resistance in bacteria. A model may also be a conceptual simplification Conceptual models have a number of uses. They require a concise statement of presumed mech- anisms and interactions; they afford a framework for interpreting observa- tions and setting research priorities; they enable extrapolation into the future or beyond the range ol known parameters; and they suggest tests of consis- tency between theory and observation A conceptual model may consist of verbal arguments logically linking a chain of hypothesis and deductions Another type ol conceptual model is a computer program that simulates the random component in a process or that calculates the values of changing quantities in a complex system based on prescribed numerical relations. An example of a computer model is the one for examining the result of repeated random sampling whose outcome is depicted in Figure 1.6, In population genetics, a kind of model frequently encountered is a mathematical model, which is a set of hypotheses that spec- ifies the mathematical relations between measured or measurable quantities (the parameters) in a system or process Mathematical models can be extiemely useful' • They expiess concisely the hypothesized quantitative lelationships between paiamelers • They reveal which parameters ate the most important in a system and thereby suggest critical experiments or observations. • They serve as guides to the collection, organization, and interpretation of observed data Genetic and Statistical Background 27 • They make quantitative predictions about the behavior of a system that can, within limits, be confirmed or shown to be fake The validity of any mode! must be tested by determining whether the hypotheses on which it is based and the predictions that grow out of it aie consistent with observations. A mathematical model is always simpler than the actual situation it is designed to elucidate. A model is supposed to be simple If it is not simpler than the real situation, then it isn't a model. Models are simpler than real sit- uations because many features of real life are intentionally ignored To include every aspect of a complex system would make a model too complex and unwieldy Construction of a model always icqiiires a compromise between realism and manageability. A completely realistic model is likely to be too complex to handle mathematically and a model that is mathematical- ly simple may be so unrealistic as to be useless Ideally, a model should include all essential features of the system and exclude all nonessential ones How good or useful a model is often depends on how closely this ideal is approximated. In short, a model is a sort of metaphor oi analogy. Like all analogies, it is valid only within certain limits but, when pushed beyond these limits, becomes misleading or even absurd In this book, we are going to take many liberties with mathematical rigor. Our excuse is that the basic ideas of a model are often obscured rather than illuminated by excessive attention to mathematical detail Our authority for the approach is the great physicist Richard Feynman, who wrote in one of his papers: Mathematicians may be completely repelled by the liberties taken here The liberties are taken not because the mathematical problems are considered unimportant On the contrary, [I hope] to encourage the study of these forms from a mathematical standpoint In the meantime, just as a poet has a license from the rules of grammar and pronunciation, we should like to ask for "physicists' license" from the rules of mathematics in outer to express what we wish to say in as simple a manner as possible. Exponential Population Growth To illustrate the nature of mathematical models (as well as some of their lim- itations) we consider the dynamics of population growth, a subject of con- siderable interest in population genetics and population biology In Figure 1.7, the solid dots show the increase in the number of cells of the yeast Safcharomyces ccrevisiae in a defined quantity of culture medium, lhe num- ber of cells increases slowly at first (0-4 hours), then more rapidly (hours 4-12), then more slowly again (hours 12-18). As a first approximation of the early stages of population growth, we may assume that a constant fraction 28 Chapter 1 Figure 1 .7 Increase in the number of cells of the yeast Saccharowyces cere- visiae in a defined quantity of culture medium (dots). The smooth curves are made from mathematical models of exponential growth or logistic growth. (Data from Pearl J 927.) of the cells reproduces in each interval of time. To simplify matters further, we will assume that the population size does not change gradually but changes in a discrete and instantaneous "jump" at the end of each hour. A model of this type is a discrete model of population growth. Thus, we may write /V, = /V M + Mr f _, 14 where N, and W M represent population size at the end of hours f and t - 1 and where r is a constant called the intrinsic rate of increase equal to the fraction of cells that reproduce in each interval of time. This equation says that the population size at the end of hour f is the sum of two components: (1) all the cells present at the end of hour t - I (which means that none of the cells die), and (2) the progeny of the WV,_, cells that divided in the interval. Equation 1.4 illustrates a feature of theoretical population genetics that sometimes leads to confusion: the same symbols are often used for different things. In this equation, r is the intrinsic rate of increase in population number. In other equations in population genetics, r is the recombination fraction between two genes linked in the same chromosome. The symbol r is used for Genetic and Statistical Background 29 still other parameters also. Any possible confusion could be avoided by indicating each parameter with a different letter; this solution is impractical because one quickly runs out of letters, even including Greek letters. Another way is to distinguish different meanings of the same lettei by typography, the use of superscripts, subscripts, and so forth. The problem with this approach is that even simple equations get to look imposing. Still another solution, which is the one adopted in this book, is to ask the reader to play close attention to the context so that, for example, r as used in the context of population growth is not confused with r used in the context of genetic linkage and recombination. The solution to Equation 1.4 is straightforward. Because N, = (1 + r)N t ^, it follows that N,_, = (1 + r)N t _ 2 . Consequently, we can write N, = (1 + r)(l + r) N f _2 = {1 + r} z N,_ 7 . However, N,_ 2 = (1 +■ r)N t _ 3 , and so N, = (1 + r) 3 N,_ 3 . Continu- ing in this manner, we eventually deduce that JV,=(l + r)'JV 1.5 For the data in Figure 1.7, if we set N = 10 (the observed number) and r = 0.7083, the first few points from Equation 1 .5 (indicated by crosses) fit very well — N = 10, N L = 17, N 2 = 29, N 3 = 50. Then the model starts to break down: N 4 = 85, N 5 = 145, N 6 = 249, and thereafter the fit becomes very bad indeed. The lesson from this example is that many models have a range over which they are reasonable approximations to the real world, in this case, for a short time after a yeast culture is inoculated. If the model is extrapolated beyond its range of validity, it yields nonsense The problem for many mod- els in population genetics is that their range of validity is unknown. In Equation 1 .5, N is defined only for f equal to positive integers because of the discrete nature of the model. Population growth is actually a continu- ous process. Population size increases gradually rather than in jumps. The continuous-growth version of Equation 1.5, shown by the dashed line labeled "exponential curve" in Figure 1.7, is given by JV(0 = JV(0>*' 1.6 where r - In (1 f r) The rationale for Equation 1.6 is based on the same sort of argument as Equation 1.4 but compressing the time scale. Whereas Equation 1 4 assumes that each unit of time is one hour, suppose that each time unit were, say, one minute. In slowing down the time scale in this man- ner, we must also decrease the value of r, otherwise too many organisms would reproduce in each unit of time. Therefore, by analogy with Equation 1.4, we can write N, - N M = r Q N,^ r but here r is the intrinsic rate of increase in the new time scale. If N(t) is a smooth, continuous function and not changing too fast, then it is easy to convince yourself that N, - JV M should approximate the derivative of N(f), which is the change in N(t) in a small 30 Chapter 1 interval of lime, and thai N,_, should be close to N(f) because we have assumed that N(i) is not changing very fast in (he new time scale. Therefore, we can write dN(t) = r N(t) 1.7 dN(Q N(t)dt 1.8 Because dlnN(t) = d N{t)/N(f)dl, where In is the base of natural logarithms, the solution of Equation 1.8 is In W(f) = r t + C, where C is a constant chosen so thai N(t) = N(0) when f = (Hence, C = In N(0) ) Expressing Ihe solution in terms of N(t) rather than JnN(f) yields Equation 1.6 Furthermore, comparing Equation 1 6 with Equation 1.5, it is clear that /V(0>">' =(\ + r)'N 1.9 and therefore r = In(] + r) is the relation between the parameter r n in the con- tinuous model and the parameter r in Ihe discrete model. Equation 1.6 is the exponential function plotted In Figure 1.7 with N(0) = 10 and r Q = 0.5355 PROBLEM 1.5 Under optimal culture conditions, the bacterium Escherichia colt can double in population size every 20 minutes. Because population growth is continuous, Equation 1.7 is the appro- priate model. A single cell of E. coli is cylindrical In Shape and has a volume of approximately 1.6 |im 3 {L6 x 10~ ia cm 3 ). A standard soccer ball has a diameter of 22 cm (roughly 9 inches) and a volume of approximately 5600 cm 3 . {a) What intrinsic rate of increase r per minute results in a dou- bling time of 20 minutes? (b) Starting with a single cell of E. colt growing under optimal con- ditions, how Long would it take to produce enough cells to fill one soccer ball? (c) How many soccer balls could be filled with cells after 24 hours of unrestricted growth? ANSWER 0) Set N(20) = 2N(0) = N(Q) exp (r„ x 20), where exp (•) stands for e { K Therefore, r n = (In 2)/ 20 = 0.034657. {b) One soccer ball Genetic and Statistical Background 31 full of cells equals 5600/(1.6 x 10" 1Z ) = 3.5 x 10 1S cells. The time needed to produce this many cells is given by f = [In (3.5 x 10 15 )]/r = 1032.7 minutes (17.2 hours), (c) After 24 hours (1440 minutes) of unrestricted growth, one cell yields exp (r x 1440) = 4.7 x 10 21 cells, which would fill more than 1.35 million soccer balls, (Note; If your answers to this problem are a little different from those given, it is probably because the numbers given were calculated to nine significant digits before rounding off.) Logistic Population Growth The calculations in Problem 1.5 indicate that no real population ran grow exponentially for more than a relatively small number of generations with- out catastrophic consequences. In nature, although lactors such as disease and predation often contribute to the control of population size, populations thai grow too large ultimately must deplete the available resources The kind of growth curve in Figure 1.7 is typical for populations expanding in a new environment: the initial population growth is exponential, but then Ihe rate of growth gradually decreases. A simple alternative to exponential growth is Ihe logistic model; the term logistic refers to proportions and, in the logistic model, the rate of population growth is assumed to decrease in proportion to the population size. By anal- ogy with Equation 1 .4, the change in population size with a disciete model of population growth takes the form 'AT-JV M JV, = JV,-i+'^-i 1.10 In this equation, K is a constant known as the carrying capacity of the envi- ronment. Observe that, when N is very small compared with K, (hen N, ~ hl,_ , + rW ( _ ,, and so population growth is nearly exponential On the other hand, when N is close lo K, then N, - N,_j, and so population growth comes to a stand- still Unlike Equation 1.4, Equation 1.10 does not have a simple solution for N, in terms of N, n . However, if the populalion grows sufficiently slowly then population growth ran be treated as continuous, and Equation 1 10 yields the differential equation dNil) dt = rNU$ K-N(t) K The solution of Equation 1 11 is given by K JV(/) = I + Ce 1 12 32 Chapter 1 where the constant C - (K - N )/N tt Equation 1.12 is called the logistic growth curve and it is derived in Problem 1.7 below. Logistic population growth results in a sort of S-shaped curve like that shown in Figure 1.7, where the parameters are r - 5355, N a = 1 and K = 665. (Note that the r and No parameters are the same as in the exponential-growth model for the same data ) The fit is obviously very good indeed. PROBLEM 1 .6 Use Equation 1.12 with N = 10, r = 0:$355, and K *\ 665 to calculate N(i) for the times t = 7 and 8 and t * 13 and 14. What ] \ are the values of r in Equation 110 for t - 8 and f = 14? Why are they > not equal to 05355? Why are they not equal to each other? ANSWER With the given parameters, N(7) * 261:53, N{$) * 349.43, N(13) = 626.13, and N(14) * 641.68. Solving Equation l.lG for r and substituting &{f) yields r * 0.5540 for t = 8 and r e 0.425 for t * 14 Nei- ther of these values agrees with r = 0i5355> nor do they agree with each other, because Equation 1J0 pertains to a discrete Model and ; Equation 1.12 tq a continuous niod«!l. Wheh the posteriori g*OWs, continuously, the value of r needed to produce a given change in pbp ulation stee in soitte discrete interval of time ''dm&V&j&dffl&wt,. . magnitude of the change in population &ix&i . .^ V r VV ^ v ^ /) ^ v . y w^w PROBLEM 1.7 Use fbe expression fii/X(a i bx))$x 'irf -fl ) to derive the logistic growth curve from Equatidfi 1.11. ANSWER Write Equation 1.11 as ^(t}/ W#- N|i)$ ¥%K^ 1&fo. :'' comparing with the integral form, it is deaf ' 'Mat .'if * IC ^sd fr = -1 . Integrating both sides in accordance with the t6^a^k ftesults in i ~(1/K) In [K - N(t)]/N(t) = rt/K + Cttsf , where c«sr is a conMaftf of iniegra- ' tion chosen so that N(t) = N(0) when i = 0. Hence, c«$* * -(l/K)' In |K - N(0)]/Kf(0) = -(1/K) In C, where C is the constant appearing in Equation 1.12. Consequently, In [K - N(t)\/N(t) * :-** '+ C, attd so \K - N(t)]/N(t) = C exp -rf. Equation 1,12 follows after some simpiifi- , cation. ; Genetic and Statistical Background 33 SUMMARY Population genetics is the application or Mendel's laws and other genetic piinciplcs to entire populations of organisms It includes the study of genet- ic variation within and between species and attempts to understand the processes resulting in adaptive evolutionary changes in species through time Population genetics has many practical applications in medicine, agri- culture, conservation, and other fields. A gene is a hereditary determinant transmitted from parent to offspring that influences a hereditary trait, often in combination with other genes and also with the environment. Alleles are alternative forms of a gene. Genotypes are formed from pairs of alleles and are either homozygous (if the alleles in the genotype are the same) or heterozygous (if the alleles are different). The physical or biochemical characteristics of an organism constitute its pheno- type. The essential mechanism of genetic transmission was established in experiments by Gregor Mendel in the years 1856 to 1863 Mendel showed that the alleles of each gene separate (segregate) from one another in the for- mation of reproductive cells or gametes. Genes are arranged in linear order along chromosomes. A chromosome may contain several thousand genes. Alleles of different genes present in the same chromosome tend to be inherit- ed together (linkage), but the allele combinations can be broken up bv recom- bination. Chemically, a gene is a region of a DMA molecule. DNA is a metaphorical "twisted ladder" consisting of two paired strands composed of polymers of nucleotides (the sidepieces of the ladder) whose bases (eithei A, T, G, or C) jut inward from the sidepieces to form the rungs. Each rung of the ladder consists of either an A-T base pair or a G-C base pair. Most genes code for the polypeptide chains of proteins through a transcript of RNA that is processed into the messenger RNA (mRNA). The polypeptide is produced stepwise by translation of the mRNA according to a triplet genetic code, in which each nonoverlapping group of three adjacent bases (a codon) specifies the amino acid to be attached to the growing chain. Alleles differ in their sequence of nucleotides A nucleotide substitution in the third position of a codon may not result in an amino replacement in the encoded polypeptide because of redundancy in the genetic code However, most nucleotide substi- tutions in either of the first two positions do result in amino acid replace- ments. A probability is a number between and 1 that measures the likelihood of a particular event being realized in an actual or conceptual experiment The addition rule applies to mutually exclusive events and states that the proba- bility of one or the other event being realized equals the sum of the separate probabilities. The multiplication rule applies to independent events and states that the probability of both events being realized simultaneously equals the product of the separate probabilities. The probabilities of various 34 Chapter 1 Genetic and Statistical Background 35 outcomes of repeated and independent trials can be deduced by application of the addition and multiplication rules and conforms to successive terms in the binomial expansion (p> + q)". Natuial populations contain genetic variation in the form of multiple alle- les of ninny genes For any specified allele, the allele frequency is the propor- tion of all alleles of the gene that are of the specified type. The allele frequency in a population must usually be estimated from a sample, and so there is variation in the estimate from one sample to the next. The variation is quantified by the standard error. If the distribution of the estimates conforms to a normal, bell-shaped distribution, then the proportions of the estimates lying within + \,±2, and ± 3 standard deviations of the true value of the parameter are 68%, 95%, and 99 7%, respectively. Estimates are also often pre- sented as a confidence interval, which expresses the degree of confidence that the true value of a parameter lies in some specified interval. A model is a deliberate simplification of a complex situation. Models may be experimental or conceptual. Conceptual models may be verbal, computa- tional, or mathematical Mathematical models are widely used in population genetics, They specify the mathematical relations between measured or mea- surable quantities that determine the changes in allele frequency in popula- tions. Population growth affords an example of mathematical modeling. In the simplest model of discrete population growth, at discrete times a constant fraction of the population reproduces, and so the population jumps instanta- neously from one size to the next. A more realistic model envisages continu- ous reproduction through time, in which case population growth is exponential The exponential model often fits population growth in newly colonized environments when the population density is low Population growth is ultimately limited by nutrients, space, or other resources. When population growth decreases in proportion to population size, the S-shaped logistic curve of population growth results; this curve is determined by the intrinsic rate of increase r and the carrying capacity of the environment K, PROBLEMS I . If you were to catch a collection of Drosophila, grind each one individual- ly in a buffer solution, and measure the rate at which this crude whole-fly homogenate catalyzed the reaction for glucose-6-phosphate dehydroge- nase, you would find that the activities would vary by more than four fold. Make a list of possible causes of this variation 2 Given the complexity of causes of variation in Problem 1 , how much vari- ation would you expect to see in the underlying genetic cause of a human inborn error of metabolism such as phenylketonuria? This disorder is caused by insufficient activity of phenylalanine hydroxylase. 3 There are 64 codons in the genetic code, and each codon can undergo nine single-site mutations (each base can mutate to three other bases), for a total of 576 mutations. How many of these result in no change in (he "meaning" of the encoded sequence? 4 Assuming that all nucleotides in all codons mutate with equal Irequency (i e , that all 576 mutations in Problem 3 occur at the same rate), are muta- tions from one ammo acid to another all equally likely? 5. The correspondence between genotype and phenotype is one of the most complex and difficult aspects of evolutionary genetics. Describe an exam- ple of a gene whose mutations cause more than one distinctly different phenotype that do not appear lo be related. 6. A population cage of Drosophila melm-wgaster is started with 50 males and 50 females, all having the genotype (e sf)/(c + si') This notation implies that one chromosome has the e and sr mutations, and the other has the wild type allele at both loci. These two loci show a frequency of recombi- nation in females of r = 037, and the males produce only non-recombi- nant gametes, Calculate the expected frequency of the gametes for both males and females and the expected offspring genotype frequencies 7. In some human cultures it is very important to have n son and a daugh- ter, and couples continue having offspring until they have one of each. If an entire population followed this rule, what would happen to the sex ratio in the population? 8. If two genes are on different chromosomes, the probability that a gamete has a particular allele of each of the two genes is the product of the prob- ability of drawing each allele because the draws are independent of one another (see the multiplication rule). If each gene is on a dilferent chro- mosome, what is the chance that genotype An Bb CC Dd produces two consecutive gametes that are A BCD? 9. If individual X has an autosomal recessive disease and both parents are unaffected, what is the chance that the sibling of X is a heterozygous car- rier? 10. A line of mice seems to consistently produce 55% male and 45% female offspring. In order to test whether this deviation is significant, how many offspring would you have to could to be able to reject a 50 ■ 50 sex ratio at a probability of a = 0.05? (Assume that the sex ratio of the mice remains 55 45) 11. A species of butterflies occurs in two distinct morphs, A and B. You sam- ple two areas and count 26 A and 28 B butterflies in one area, and 10 A and 21 B in another area Is it possible that these two samples could come from a single homogeneous population, or are the frequencies of the two morphs significantly different from one another? 12 Levy and Levin (1975) used electrophoresis to study the phosphoglucose isomerase-2 gene in the evening primrose Oenothera biennis, a complex genomic helerozygote made true breeding by chromosomal transloca- tions. They observed two alleles affecting electrophoretic mobility ol the L 36 Chapter 1 enzyme, and among 57 strains they found 35 PGI-2n?PGl-2(i, 19 PGl- 2a/PGI-2b, and 3 PCI-2b/PGI-Zb genotypes a Calculate the allele frequencies of PGI-2n and PGI-2b. b. With random mating, what would be the expected numbers? 13 The simple models of population growth fail to take into account many factors (hat affect rates of change. The global human population al a.d., 200 A D., and at intervals of 200 years up lo the present has been estimat- ed in millions of people as 200, 200, 200, 200, 250, 280, 350, 400, 550, 980, and 6000. If the population were growing exponentially, these points would fall on a straight line when plotted on a logarithmic scale. Draw this plot. What do you conclude? 14 A healthy pair of Drosoptriln can produce 500 offspring in 12 days, each adult fly weighing about 1 mg. Assume that the parental flies die after they finish reproducing. (Actually, they Jive about a month.) It all succes- sive generations get enough to eat and remain this fecund, what will the mass of flies be in one year? CHAPTER 2 Genetic and Phenotypic Variation Phenotypic Variation Normal Distribution Mendeuan Variation Protein Polymorphisms • DNA Polymorphisms • Multiple-Factor Model enetic variation in populations became a subject of scientific inquiry in the late nineteenth century prior even to the rediscov- ery of Mendel's paper in 1900. The leading exponent of the study of hereditary differences among human beings was Francis Galton (1822-1911). Galton was a pioneer in the application of statistics to biology. Me used statistical methods to study physical (raits such as eye color and fin- gerprint ridges as well as behavioral traits such as temperament and musical ability. Galton was among the first to examine the statistical relations between the distributions of phenotypic traits in successive generations. He is regarded as the founder of biometry, the application of statistics to biologi- cal problems. PHENOTYPIC VARIATION IN NATURAL POPULATIONS Galton and Mendel exemplify opposite approaches to the study of inherited traits. Mendel's point of departure in the study of genetics was discrete vari- ation, in which phenotypic differences among organisms can be assigned to a small number of clearly distinct classes, such as round versus wrinkled peas. Gallon's point of departure was continuous variation, in which the phenotypes of organisms are measured on a quantitative scale, like height or weight, and in which the phenotypes grade imperceptibly from one catego- ry into the next As material for the study of phenotypic variation, Gallon's choice was good: most of the differences among noimal people that are vis- 37 38 Chapter 2 lble to Ihe unaided eye are differences in continuous traits — height, weight, skin color, hair color, facial features, running speed, shoe size, and so forth. The same is true of phenotypic variation in other organisms. On the other hand, as material lor the study of genetic variation, Mendel's choice was good' The pattern of segregation of alleles is revealed most clearly in pedi- grees of discrete, simple Mendelian traits. Continuous Variation: The Normal Distribution With continuous traits, not only do the pheno types grade into one another, but the traits also usually present difficulties for genetic analysis. The prob- lems are of two principal types - • Most continuous traits are influenced by the alleles of two or more genes, hence the segregation of any one gene in pedigrees is obscured by the segregation of other genes that affect the trait. • Most continuous traits are influenced by environmental factors as well as by genes, and so genetic segregation is obscured by environmental effects These problems are not insurmountable in organisms with a sufficiently high density of genetic markers scattered throughout the genome (the com- plement of chromosomes) because the genetic markers can be tracked in pedigrees along with the continuous trait of interest. Organisms with suffi- ciently dense genetic maps include human beings, laboratory animals, and many domesticated animals and crop plants. In Gallon's time, however, studies of continuous traits based on genetic linkage were unknown. Why, then, did Galton focus on continuous traits? Because they have a sort of regularity — a statistical predictability — of their own For many continuous traits, when the phenotypes are grouped into suitable intervals and plotted as a bar graph, the distribution of phenotypes conforms closely to the normal distribution, the symmetrical, bell-shaped curve discussed briefly in Chapter 1 in the section on phenotypic diversity and genetic variation. For example, a bar graph of Gallon's data on the heights of 1329 men, rounded to the nearest inch, is plotted in Figure 2.1. The smooth curve is the normal distribution that best fits the data. The equa- tion of ihe normal curve is* 2a 2 2.1 where jr ranges from -°° to +°°, and tc = 3.14159 and e = 2.71828 are constants The location of the peak of the distribution along the x axis is determined by the parameter u, which is the mean, or average, of the phenotypic values, The degree to which the phenotypes are clustered around the mean is deter- Genetic and Phenotypic Variation 39 N = 1329 c 150 .v = 69 " a = 2 5 6 o u M B 100 - KtTfS sK 1/ <63 64 65 66 67 68 69 70 71 72 73 74 >75 Height (rounded to (he nearest inch) Figure 2.1 Distribution of height among 1329 British men (Data from Galton 1889) mined by the parameter o 2 , which is the variance of the distribution. Mathematically, the variance is the average of the squared difference of each phenotypic value from the mean; that is, it is the average of the values of (i - pT How u and a 2 are estimated from data is considered next. Mean and Variance Because p and cr 2 are parameters, their values are unknown, and they must be estimated from the data themselves. The height data are tabulated m Table 2.1, in which/, is the number of men whose height is x„ rounded to the nearest inch. (The fact that the shortest and tallest men are grouped in the tails of the distribution makes no difference because these men account for only a small proportion of the total sample.) Also tabulated are the products /, x ,v, and/ x x, 2 as well as their sums. 1 he mean p of the distribution is estimated as the mean of the sample, which is conventionally denoted x (also sometimes as fi). If' 2.2 In this example, x = 91,639/1329 = 68.95 inches Likewise, the vaiiance o 2 of the distribution is estimated as the variance of the sample, which is conventionally denoted s 2 (also sometimes as cr> -(T)- 2.3 40 Chapter 2 The expression in the middle follows directly from the definition of the vari- ance it is the average of the squared deviations from the mean because, for each value of *„ (x, - 3E) is the deviation of thai value from the mean The expression on the right is identical arithmetically but easier to apply in prac- tice. In the example in Table 2.1, s 2 = 6,326,939/1329 - (68.96) 2 = 6.11. (This value may differ slightly from your own calculation according to the num- ber of significant digits you carried along before rounding off.) If the sample size is small (say, less than 50), then a slightly better estimate of the variance is obtained by multiplying the expression in Equation 2.3 by n/(n - 1), where n is the total size of the sample (in this case, 1329). Closely related to the variance is the standard deviation of (he distribu- tion, which is the square root of the variance. The standard deviation is a nat- ural quantity to consider in view of the units of measurement. In Table 2.1, foi example, each measurement is in inches The mean is also in inches. How- ever, the variance, being the average of squared deviations, has the units of squared inches— which seems more appropriate for an area than for a height. Taking the square root of the variance restores the correct unit of measure: in this example, inches. The estimate of I he standard deviation is conventional- ly denoted s (also sometimes as d) and it is calculated as the square root of the quantity in Equation 2.3. In the height example, s = 2.47 (which may TABLE 2.1 HEIGHTS OF 1329 MEN Height Height Nearest Number of interval (i) range (in,) inch (X;) men (f,) fj X Xj fix X, 1 <63.5 63 23 1,449 91,287 2 63 5-64 5 64 20 1,280 81,920 3 64 5-65.5 65 64 4,160 270,400 4 65 5-66.5 66 110 7,260 479,160 5 66 5-67 5 67 155 10,385 695,795 6 67.5-68 5 68 199 13,532 920,176 7 68 5-69.5 69 203 14,007 966,483 8 69 5-70 5 70 J 98 13,860 970,200 9 70 5-71 5 71 171 12,141 862,011 10 71 5-72.5 72 88 6,336 456,192 11 72 5-73.5 73 47 3,431 250,463 12 73 5-74 5 74 27 1,998 147,852 13 >74 5 75 24 1,329 1,800 91,639 135,000 Totals 6,326,939 CI/,) (IM) <ito ? ) Scunce Data fiom Gallon 1889. Genetic and Phenotypic Variation 41 again differ slightly from your own calculation because of round -off error). The estimate s of the standard deviation is often called the standard error. When estimating a proportion — such as the frequency of an allele in a population — the standard error is calculated according to Equation 1 3 in Chapter 1 . In Chapter 1, the values 68%, 95%, and 99.7% quoted as the proportions of observations expected to fall within 1, 2, or 3 standard errors of I he mean, respectively, emerge directly from Equation 2.1 for the normal distribution. In a normal distribution, the exact proportion of observations falling with any specified range of x equals the integral of Equation 2.1 across the specified range. For the normal distribution, the integral between the limits u + a equals 0.6827, that between u ± 2cf equals 0.9545, and that between u ± 3a equals 0.9973. In data analysis, x and s are used in place of p and a. Inciden- tally, the integral of the normal distribution between the limits u ± 4a equals 9999; this result says that fewer than one in 10,000 observations falls more than four standard deviations from the mean. Central Limit Theorem Galton was immensely impressed with the observation that many natural phenomena follow the normal distribution. He writes: I know of scarcely anything so apt to impress the imagination as the wonder- ful form of cosmic order expressed by the "law of frequency of error" (the nor- mal distribution] Whenever a large sample of chaotic elements is laken in hand and marshaled in the order of their magnitude, this unexpected and most beautiful form of regularity proves to have been latent all along. The law would have been personified by the Greeks if they had known of it It reigns with serenity and complete self-effacement amidst the wildest confusion. The larger the mob and the greater the apparent anarchy, the more perfect is its sway ft is the supreme law of unreason. It is, indeed, remarkable to consider that pure, blind chance is the reason for this "unexpected and most beautiful form of regularity." The theoretical basis of the normal distribution is known in probability theory as the central limit theorem. Roughly speaking, the central limit the- orem states that the sum of a large number of independent random quanti- ties always converges to the normal distribution. For our purposes, "independent" in this context means that information about any one of the observations gives no improvement in the ability to predict any other of the observations A large number of independent random quantities is appar- ently what Galton meant by "a large sample of chaotic elements." The central limit theorem explains in part why so many continuously distributed traits conform to the normal distribution. Most continuous traits are multifactorial, meaning that they are influenced by "many factors," typically several or many genes acting together with environmental factots. Among human 42 Chapter 2 Genetic and Phenotyprc Variation 43 beings, for example, the obvious differences between normal people in hair color, eye color, skin color, stature, weight, and other such traits are not usu- ally traceable to single genes. They result from the combined effects of sever- al or many genes as well as numerous environmental effects acting together as "a large sample of chaotic elements/' which often produce, in the aggre- gate, a normal distribution of phenotypes. It should be emphasized that the "large number" of random elements specified in the central limit theorem need not be excessive. As an example, Figure 2.2 is a bar graph of 100 observations in which each "observation" con- sists of the sum of nine consecutive random numbers chosen with equal probability from anywhere in the range (-1, f 1 ) For the sum of nine random numbers in this range, the theoretical mean equals and the theoretical stan- dard deviation equals 1.73; the sample values were x = -0.12 and s = 1 .70. Expressed as a deviation from the mean in multiples of the standard error, the number of observations in each category is shown at the top of the bar in Figure 2.2. Because the expected numbers are 2.5, 13.5, 68, 13 5, and 2.5, the fit to a normal distribution is obviously very good In this example, therefore, fewer than 10' "chaotic elements," when added together, yields "this unex- pected and most beautiful form of regularity." 70 60 | 50 i S 40 o o 30 M B 20 y I0r- 69 -3 -2 -1 +1 +2 Deviation from mean {+SE) Figure 2.2 Distribution of 100 values of the sum of nine random numbers from the interval (-1, +-]). PROBLEM 2. 1 At an International Health Exhibition in London iin 1884, Gallon set up an "anthropometric laboratory" that carried out tens of thousands of measurements covering a wide range of human traits. Among the traits was "strength of pull," expressed as the num- ber of pounds that a person could pull with One arm against a resist- ing force in a sort of arm-wrestling contraption (Galton 1889). The data for 519 males aged 23-26 years fell into the following categories (the number in parentheses is the number of males in each category): 40-50 lbs (10), 50-60 (42), 60-70 (140), 70-80 (168), 80-90 (113), 90-100 (22), 100-110 (24). Using the midpoint of each category as the strength of pull for all males in that category, estimate the mean and standard deviation of strength of pull. Assuming that strength of pull has a normal distribution with parameters equal to these estimates, what is the expected proportion of males whose strength Of pull exceeds 112 pounds? ANSWER The values of x ; are 45, 55, 65, and so forth. Then X/j = 519, Ifa = 38,675, and I/x, 2 = 2,963,375. Hence, x = 74.5 lbs, s 2 = 156.8 lbs 2 , and so s = 12.5 lbs, (Answers may differ slightly because of round -off error.) A strength of pull of 112 lbs is three standard errors above the mean; hence a proportion of only (1 - 0.997)/2 = 0.0015 (about one in 667) males is expected to have a phenotype exceeding this value. Discrete Mendelian Variation Discrete Mendelian variation (also called simple Mendelian variation) refers lophcnotypic differences resulting from segregation of the alleles of a single gene. Environmental effects on the trait are small enough, relative to hered- itary differences, that the transmission of alleles determining the trait can be traced through pedigrees. An example of discrete Mendelian variation is the inheritance of red, pink, or white flower color in snapdragons (Chapter 1). his case is exceptionally convenient for genetic studies because of the inter- mediate phenotype of the heterozygote However, most of the phenotypic vanaluin in natural populations is multifactorial In human beings, for exam- ple, although simple Mendelian variation accounts for many inherited dis- orders, each of ihc disorders is relatively rare. honkally, simple Mendelian varialion is more easily detected by studying genes and I heir products than by studying phenotypes. Because the mecha- nisms of transcription, RNA processing, and translation are relatively free of gene interactions and environmental effects thai complicate the analysis L 44 Chapter 2 of multifactorial traits at the phenotypic level, there is a direct connection between DNA sequences and alleles and a nearly direct connection between genes and their products Indeed, the correspondence between DNA sequences and alleles is one-to-one' different alleles have different DNA sequences irrespective of whether the alleles affect phenolype. Likewise, alle- les with nonsynonymous cod on differences in a protein-coding region result in different amino acid sequences irrespective of what the polypeptide does in metabolism or how the difference in sequence affects the organism. Hence, an efficient way to detect simple Mendelian variation is to study molecules — and therein lies a paradox. As evolutionary biologists, popula- tion geneticists are interested in observable phenotypes that are likely to be subject to natural selection: morphology, rate of development, mating behav- ior, age of reproduction, longevity, and so forth (in short, the types of traits that attracted Galton). On the other hand, genetic studies are most readily carried oul with simple Mendelian variation detected as differences between molecules. The paradox is that differences in molecules among healthy organisms are not usually related in any obvious way to differences in phe- notype Thus, there is a gap in being unable to specify exactly which types of molecular differences underlie the evolutionary process. The irony of the sit- uation is similar to that described by the physiologist Albert Szent-Gyorgyi: My own scientific life was a descent from higher to lower dimensions, led by the desire to understand life I went from animals to cells, from cells to bacte- ria, from bacteria to molecules, from molecules to electrons. The story had its irony for molecules and electrons have no life at all On my way, life ran out between my fingers The gap between genotype and phenotype results from the complex inter- actions between genes and environment in the determination of physiology, development, and behavior In evolutionary biology, the complexity is even greater because the key issue is the relative ability ol organisms to survive and reproduce in their environments. Nevertheless, the disconnect between dif- ferences in molecules and evolutionary adaptations is by no means inevitable, permanent, or insurmountable. It is already clear that the study of the relation between genetic variation and evolutionary adaptation must be high on the agenda of evolutionary biology lor the next century, and already there are many examples in which the relation is quite well established. EXPERIMENTAL METHODS FOR DETECTING GENETIC VARIATION For nearly 50 years, the workhorse method for revealing genetic variation has been electrophoresis because small differences in rate of migration in an eleclrophoretic field can be used to distinguish between nearly identical macro molecules A typical laboratory setup for electrophoresis is illustrated Genetic and Phenotypic Variation 45 Bands (visible after sinlable lieatmenl) - -^ 'T 1 IT' I *> b [ a \B&&d^)EE>U&\ Power supply Electrode Figure 2.3 One type of laboratory apparatus for electrophoresis. The proce- dure is widely used to separate protein or DNA molecu les In conventional eels DNA fragments smaller than about 20 kb migrate approximately in proportion ' to the logarithm of their molecular weights. schematically in Figure 2.3. The tray contains a thick layer of a gel, typically starch, acrylamide, or agarose; it may be placed horizontally (as shown in the illustration) or vertically (with the gel sandwiched between two glass plates) Each sample of material is placed in a small slot near the edge of the gel Connected to each edge of the gel is a chamber containing a buffered solu- tion and electrodes. In electrophoresis, an electric current is applied across the gel for several hours. Molecules in the samples-usually proteins or nucleic acids are of greatest interest-move through the gel m response to the electric field Molecules of different size and charge move at different rates. After the electrophoresis is finished, the positions of the molecule or molecules of interest are revealed by any of several procedures. Protein Electrophoresis In protein electrophoresis, used primarily to study enzyme molecules, the position to which a particular enzyme migrates is revealed by soaking the gel in a solution containing a substrate for the enzyme along with a dye that precipitates where the enzyme-catalyzed reaction takes place. A dark band thus appears in the gel at the position of the enzyme. If the enzvme present m a sample has an amino acid replacement that results in a difference in (he overaH .onic charge of the molecule, then the cn/yme will have a somewhat altered electrophoretic mobility and move at a different rate. The elec- trophoretic mobility changes because enzymes of the same size and shape move at a rate determined largely by the ratio of the number of positively charged amino acids (primarily lysine, arginine, and hi.stidine) to the num- 46 Chapter 2 be i' of negatively charged ones (principally aspartic acid and glutamic acid). Electrophoresis can therefore be used to detect a mutation that results in a difference in elechophoretic mobility of the enzyme it encodes. One possible result of an electrophoresis experiment is shown in the hypothetical gel in Figure 2.4A, in which all samples manifest an enzyme with the same electrophoretic mobility The result indicates a monomorphic sample because there is only one electrophoretic pattern observed. Another kind of result is shown in Figure 2.4B, in which polymorphism is observed in the types of electrophoretic patterns. When polymorphic enzyme bands are observed, genetic tests typically indicate that organisms with only a fast-migrating enzyme are homozygous for a fast allele (F/F) and those with only a slow-migrating enzyme are homozygous for a slow allele (S/S) Organisms with both enzyme bands are heterozygous for the alleles {F/S'j. Simple Memdelian inheritance of the polymorphism is indicated by, for example, the finding that matings of two heterozygotes produce, on the average, V 4 F/F, V 2 F/S, and % S/S progeny. Two enzyme bands appear in heterozygotes whenever the active enzyme consists of a single polypeptide chain {rather than two or more polypeptide chains aggregated together) because heterozygotes produce a different polypeptide chain from each allele. Enzymes that differ in electrophoretic mobility as a result of allelic differ- ences in a single gene are called allozymes. Hence, allo/yme variation in a population is an indication of simple Mcndelian genetic variation. Allozyme variation is widespread in almost all natural populations studied by (A) Monomorphic sample (B) Polvmoi phic sample ! f t S S I S 5 r i r ' s r 5 r J r j r f s s r r r $ r r » F f Figure 2.4 Monomorphism and polymoiphism (A) Hypothetical gel showing protein monomorphism All samples have an enzyme with the same elec- trophoretic mobility. (B) Hypothetical gel showing allozyme polymorphism. Eight samples arc homozygous foi an allele (F) that codes for a rapidly migrat- ing enzyme; two samples are homozygous for a different allele (S) that codes for a slowly migrating enzyme, and six samples are heterozygous (F/S) and there- fore exhibit enzyme bands corresponding to both alleles. Genetic and Phenotypic Variation 47 electrophoresis, including organisms such as bacteria, plants, Drosophiln, mice, and human beings. ' MfO&LEM 2.2 A sample of 35 orgahisms from a Texas population of t ' the wild annual plant Phlox drummondii were examined for the elec- t frdphoretic mobility of the eirtzyme alcohol dehydrogenase (Levin I; 1978). Two alleles affecting electrophoretic mobility were found— Adh* and Adk h . The genotype frequencies observed in the sample ^OMAdfe/Mh A ,032AdH*/Adh h , and 0.64 Adh h /Adh h . Estimate the allele frequency of Adf? and its standard error. ftN$yytR " 'iMp rtspfeseni'ty allele frequency of Ad\f. Then p = 0.0 4 + 0.32/2 *» &20. The standard error eoualg V(020)(l - 0.20) /(2 x 35) « 0.05. ■ ■ , ; ':■: , ■ ■■ s<VYi';* :.~^^*~r,-?m; l'>\$Q&tki£tA iL% Frofn a natural population of Drosophita nieltmogaster ^jftl 8&lid#i, North CaMiha,! 660 fertilized females were trapped and med to found a large laboratory population {Mukai et al. 1 974), After > tyfookt five months (TO generations), 489 third chromosomes in the Of&I&Hon were examined for allozymea coding for the enzymes ^/•-WjMMie^ (allele*! E6 F and E6 S ), esterase-C (alleles EC F and EC S ), and [';;*$SfciiM dehydrogenase (allele* Qrfft* and Odh s ). The order of the - gene* in the third chromosome is known to be E6-EC-Odh, The results were as follows: j I E6 f EC f Od}/\[52 ,E6 s EC F Odh F 264 E6 f EC f O(T/i s 7 E6 r EC s 6d^ 15 E6 S EC'' Odh s 13 E6 S EC S Odh F 29 .,,"!'. ; " . EtfEtf Odh s _ 1 , , ' £6 S EC S Odh s 8 Estimate the allele frequencies and their standard errors for E6 f and E6 S , for EC* and EC S , and for CMf» F and Qdh s . What number of each of the chromosome types ^expected assuming that the alleles are asso- ciated at random? : 48 Chapter 2 AN SWER For esterase-6, there were 1 75 E6 f and 314 £6 S allele, yielding p « 175/489 * 0.358 for E6 f and <f = 314/489 * 0.642 for E6 S ; the standard e rror is the same for both estimate, diftd s^Mts V(0.358)(0.642)/489* 0.022. For the other allele^ ffte/fesMfcite& afM their standard errors are 0.892 ± 0.014 for £C f and 0i# il EC 5 ; and 0.941 ± 0.011 for Ot^ and 0.059 ± 0.011 for tSflftj random combiflatiofis, ttie expected ritimfc& tf Writs'* type equals thebrodtfct of the allele rteqtifflfeiet ttihtft m pie, for £6 F ECOdh r , <he ejected rWmbeft Is (Jl39ft*&ft&fc I 489 = 146.8. The expected numbers (observed in pufciijfittb0 eight chromosome types are: 146.8 (152)> 263.4 $B$^&&$&&& 17.8(15),3iO(mi.ia)^.0(8).Tr^«odeldfrJfMi8irl " ' " ' of alleles fits very wtlL ■ 4 The Southern Blot Procedure Like polypeptides, DNA fragments can be separated by electrophoresis. Unlike a polypeptide, which has a predetermined size according to the num- ber of amino acids it contains, a molecule of chromosomal DNA is random- ly sheared into fragments of various size during purification. Therefore, in any DNA preparation, the DNA fragments containing a particular sequence have a range of sizes depending on where on each side of the sequence the chromosomal DNA became sheared. Fortunately, there is a class of enzymes that cleaves DNA at particular sites along the molecule Consequently, when chromosomal DNA is cleaved with such an enzyme, each DNA fragment containing a particular sequence is cut at the same sites on either side and so will have the same length. The enzymes that cleave DNA at particular sites are called restriction enzymes. Each type of restriction enzyme cuts double-stranded DNA at all sites at which there is a particular nucleotide sequence called the restriction site of the enzyme. Examples of restriction enzymes and their restriction sites are shown in Figure 2.5; the cuts are made at the positions of the arrows. For example, the enzyme Alul cuts at sites of the four-nucleotide sequence AGCT, and EcoRI cuts at the six-nucleotide sequence GAATTC. Most restric- tion enzymes used in population studies have either four-nucleotide or six- nucleotide restriction sites. DNA is also unlike an enzyme in that it lacks any catalytic activity that can be used to determine the location of a band in a gel. On the other hand, any single strand of DNA is able to form a double-stranded molecule by Genetic and Phenotypic Variation 49 Restriction enzyme Alul Restriction site 5'-AGCT-3' 3'-TCGA-5' Hhal HaelU EcoR] BflmHJ Xhol 5'-GCGC-3' 3'-CGCG-5' 5'-GGCC-3' 3'-CCCG-5' I 5'-GAATTC -3' 3'-CTTAAG-5' T i 5'-GGATCC-3' 3-CCTAGG-5' J 5-CTCGAG -3' 3'-CAGCTC-5' Figure 2.5 Restriction enzymes cleave DNA molecules at sites of specific, short nucleotide sequences. More than 500 different restriction enzymes are commercially available. They are essential tools in DNA analysis and gene cloning. The cleavage site in each DNA strand is indicated by the arrow. pairing with another strand having the complementary base sequence. This pairing of complementary DNA strands is the physical basis of the most widely used procedure for identifying DNA fragments in a gel; the proce- dure, illustrated in Figure 2.6, is a Southern blot. The reagent used for iden- tification is a molecule of DNA called the probe, which contains the nucleotide sequence of interest. Probe DNA is usually obtained from a gene that has been cloned (for example, into a bacterial cell) or by amplification with the polymerase chain reaction (described in the next section). Tn the Southern procedure, DNA restriction fragments that have been separated by electrophoresis are rendered single -stranded by soaking in a solution of sodi- 50 Chapter 2 Genetic and Phenotypic Variation 51 filter DMA restriction fragments (A) Blot (B) Hybridize filter with radioactive probe (Dark bands not visible at this stage ) (Q Photographic film exposed to filter Dark bands appear on film Figure 2.6 Southern blot procedure (A) DNA fragments separated by elec- trophoresis are transferred and chemically attached to a filter. (B) The filter is mixed with radioactive ptobe DNA, which sticks to homologous DNA mole- cules in the filter. (C) After washing, the filter is exposed to photographic film, which develops dark bands caused by radioactive emissions from the probe DNA in homologous chromosomes <^ Restriction sites aaA ■ Probe DNA DNA bands Figure 2. 7 Restriction fragment length polymorphisms (RFLPs) result from the presence or absence of particular restriction sifes in DNA. In this example, the DNA molecule designated A contains three restriction sites, and the one des- ignated a contains four. Genotypes A A, Aa, and aa each yield a diffeient pattern of bands in Southern biol using the indicated probe DNA um hydroxide, then blotted onto a nitrocellulose or nylon filter where subse- quent chemical trealment attaches thern (Figure 2.6A). The filter is then bathed in a solution containing probe DNA that has been rendered radioac- tive (part B). As the solution cools, the probe DNA strands form double- stranded molecules with their complementary counterparts on the filter, and careful washing removes all of the probe DNA that has remained unpaired. The filter is sandwiched with photographic film, where radioactive disinte- grations from the bound probe result in visible bands (part C). Alternatively, the probe may be chemically modified and the bands visualized by fluores- cence or staining. Genetic differences resulting in the presence or absence of restriction sites can be identified because they change the length of characteristic restriction fragments. An example is illustrated in Figure 2.7, The upper part of each panel shows the location of restriction sites m the DNA molecules in a diploid genotype. The ff-type molecule contains one additional restriction site not present in the /l-type molecule. The lower part of the figure demonstrates that, with suitable probe DNA, all three genotypes can be distinguished by then pattern of restriction fragments. A difference in the length of a restric- tion fragment found segregating in natural populations is called a restriction fragment length polymorphism or RFLP Because RFLPs are widely distrib- uted throughout the genome of human beings and other organisms, they have assumed major importance in population genetics The Polymerase Chain Reaction The polymerase chain reaction (PCR) for the amplification of specific DNA sequences is of great utility in population genetics for the production of probe DNA or for the direct determination of the amount of nucleotide sequence variation present in natural populations. The method is outlined in Figure 2.8. The original DNA sequence to be amplified is shown in black and the newly synthesized DNA strands in gray The small ovals represent syn- thetic oligonucleotides that are complementary in sequence to the ends of the region to be amplified. The oligonucleotides are called primer sequences because they anneal to the ends of the sequence fo he amplified and are used as primers for chain elongation by DNA polymerase Primer oligonu- cleotides are typically 18-22 nucleotides in length. DNA to be used as the template in a PCR reaction is first mixed with both primers along with a ther- mostable DNA polymerase in a buffer solution. The PCR amplification takes place in cycles. In the first cycle, the DNA is heated to separate the strands and then cooled in (he presence of a vast excess of the primer oligonu- cleotides. Then elongation of the primers produces double-stranded mole- cules. The second cycle of PCR is similar to the fust but, after the second cycle, there are four copies of each original molecule. The cycle is repeated from 20 to 30 times, each resulting in a doubling of the number of molecules ihe theoretical result of n rounds of amplification is 2" copies of each tem- plate molecule originally present. L 52 Chapter 2 DNA duplex lo be amplified o o o o o o o o First cycle Primer oligonucleotide; Second cycle > nth cycle ' 2" copies Third cycle Figure 2.8 The polymerase chain reaction (PCR). Short primer oligonucleo- tides arc used as primers to initiate DNA replication from opposite ends of a DNA duplex to be amplified. After each round of replication, the DNA is heated to separate the strands and then cooled to allow new primers to anneal. Repeat- ed rounds of replication result in an exponential increase in the number of tar- get molecules PCR amplification is very useful in generating large quantities of a specif- ic DMA sequence without the need for cloning. The main limitation of the technique is that the DNA sequences at the ends of the region to be amplified must be known so that primer oligonucleotides can be synthesized. There are many applications in which this requirement is met. In population genetics, for example, PCR can be used to amplify different alleles present in natural populations. •»s - - « - * j»rftr x- ■**£ ":* ,\y PROBLEM 2.4 PCR was used to amplify five alleles (designated/-/) of the gene Rh3 coding for a light-sensitive protein in the eye of Drosophila simulans, a species of fruit fly closely related to D. mclanogaster. The resulting DNA fragments were sequenced (Ayala et al. 1 993). The data show the nucleotide present at each of 16 polymor- phic nucleotide sites found in the first 500 nucleotide sites in the amino acid coding region of the gene; the remaining 484 nucleotide sites were monomorphic in this sample. Any nucleotide site that is an Genetic and Phenotypic Variation 53 exact multiple of three is at the third position of a codon. In this region of the gene: (a) what proportion of polymorphic nucleotide sites are in third posi- tions of codons? What can you infer from this observation? (b) what proportion of nucleotide sites are poiymorphic? (c) why is the standard error formula not appropriate for the estimate in part 0)? Muctattfcfe sftm In gmne. Ateft 132 142 162 192 m 201 20T 240 246 m 154 m 3/i 405 417 483 f T c T A c c 1 C C T c G G T T A * ',* c C T A c c T C C T G G T T T h c T C C c ,c c T c T T T G C T A . ,- 'v c T c, C c G c T T C, T G A C T T i' c. T c C c T c T T, r t G G c C A ',;.''•■'■'.';.. ANSWER (*}} Among the 16 polymorphic sites, only site 1 42 is not 'ijl exact multiple of threeL hence 15/16 = 94% of the polymorphic sites 'Ire in the third codon position* the inference is that many of the nucleotide polymorphism^ are silent (synonymous) in that they do \akpi 'alijsjr tKe amino add sequence of the polypeptide: (In fact, all 16 ^ift.:$M'^^^|tisn^j'|iklildbg the C -> T change in 142, which JvaftB me codon from CtJA -*'tnJA, both of which code for leucine.) J^jlJA ^totll of, 16/500 == 3.2% df the nucleotide sites are polymorphic in $$& df the genfe ; '(c) The binomial standard error is not appro- ^ L ml!^<5itee becau&e' tfie nucleotides within a 1 gene are not inde- "lf^m|>iesf rhe'y; are "genetically closely linked. '(A suitable pi the standard error is given later in this chapter.) POLYMORPHISM AND HETEROZYGOSITY Monomorphism or polymorphism of a gene in a sample is usually of interest only insofar as it indicates monomorphism or polymorphism of the gene in the population as a whole. In a population, a polymorphic gene is one for which -V the most common allele has a frequency of less than 0.95 (some authors prefer a more stringent cutoff at 99). Conversely, a monomorphic gene is one that is not polymorphic. The cutoff at 0.95 (sometimes 99) in the definition of poly- 54 Chapter 2 Genetic and Phenotypic Variation 55 rnorphism is arbitrary, but it serves to focus attention on those genes in which allelic variation is common. In any large population, rare alleles are observed for virtually every gene. An allele is considered a rare allele if its frequency is less than 005; in human beings, between one and two people per thousand are heterozygous for rare alleles of any gene. Many rare alleles are deleterious and are presumably maintained in the population by recurrent mutation. The defi- nition of polymorphism is an attempt to focus on genes that have alleles with frequencies too high to be explained solely by recurrent mutation to harmful alleles. With the 0.95 definition of polymorphism given above, and if alleles are combined at random into genotypes, then at least 9.5% of the population is het- erozygous for the most common allele (because 2 x 0.95 x 0.05 = 0.095). Allozyme Polymorphisms Polymorphism of alleles that determine allozymes is extremely widespread. Figure 2.9 summarizes the results of electrophoretic surveys of 14 to 71 (mostly around 20) genes in populations of 243 species. Each point in the fig- ure (except that for human beings) gives the type of organism studied and the number of species examined. The axis labeled Polymorphism refers to the estimated proportion of genes that are polymorphic by the 0.95 criterion. The axis labeled Heterozygosity refers to the average heterozygosity in each group. The average heterozygosity is the estimated proportion of genes expected to be heterozygous in an average organism; it is estimated as the proportion of heterozygous genotypes for each gene averaged over all genes. For example, the data for Europeans include an English population in which 10 enzyme genes were examined (Harris 1966). Of the 10 genes, three were found to be polymorphic, from which the estimated proportion of polymor- phic genes in the genome is 3/10 = 0.3. The observed proportion of het- erozygous genotypes for each of the three polymorphic genes was 0,509 (for red-cell acid phosphatase), 0.385 (for phosphoglucomutase), and 0.095 (for adenylate kinase); the average heterozygosity in this sample — taking into account the additional seven genes for which the observed heterozygosity was 0— is therefore (0.509 + 0.385 + 0.095 + 7 x 0)/10 = 0.099. The vertical and horizontal bars on the point corresponding to Dwsophih indicate the size of the standard error of the estimate Therefore, the bars indicate the limits of polymorphism and heterozygosity within which about 68% of the species are expected to fall. Among Drosophila species, approxi- mately 68% have a proportion of polymorphic genes in the range 0.30-0.56 and an average heterozygosity in the range 0.09-0.19. Such bars could be attached to each point; their lengths would be comparable to those for Drosophik, indicating substantial variability in polymorphism and heterozy- gosity among species within groups. Figure 2.9 has no simple summary because of the immense variability in polymorphism and heterozygosity found within each group of organisms (as 0.60 0.55 0.50 45 6 0.40 % 035 0.30 1 0.25 °- 0.20 015 0.10 0.05 Humans {Europeans, 71 loci) Insects (23) (excluding Dwsophih) /■ , Drosophik (43) Reptiles (17) Birds (7) • • ••s^-AU vertebrates (135) Mammals (46) ^ "~- Ail invertebrates (93) Invertebrates (27) J (excluding insects) • \ Amphibians (13) Plants (15) 0.04 0.06 0.08 0.10 0.12 14 Heterozygosity 20 Figure 2.9 Estimated levels of heterozygosity and proportion of polymorphic genes derived from allozyme studies of various groups of plants and animals. The number of species studied is shown in parenthesis beside each point. Squares denote averages for plants, invertebrates, and vertebrates. The bars across the Drosophik point indicate the standard error within which about 68% of the species are expected to fall. Other groups have similarly large standard errors. (Data from Nevo 1978.) indicated by the length of the variability bars corresponding to Drosophila) On the whole, there is a positive relationship between amount of polymor- phism and degree of heterozygosity. This relationship is as expected because the greater the fraction of polymorphic genes in a population, the more genes that are expected to be heterozygous on the average. The overall mean poly- morphism in Figure 2.9 is 0.26 ± 0.15, and the mean heterozygosity is 0.07 ± 0.05. Vertebrates have the lowest average amount of genetic variation among the groups in Figure 2.9, plants come next, and invertebrates have the high- est. Drosophila is the most genetically variable group of higher organisms so far studied, and mammals the least variable. Human beings are fairly typical of large mammals: An extensive electrophoretic survey of 104 genes in a sam- ple including all major human races gave estimates of polymorphism of 0.32 and heterozygosity of 0.06 (Harris et al. 1977). The one obvious conclusion that can be reached from Figure 2.9 is that allozyme polymorphisms are widespread among higher organisms. Genetic variation is even more preva- lent among some prokaryotes. For example, natural isolates of the mam- malian intestinal bacterium Escherichia coli exhibit levels of genetic polymorphism two or three times greater than vertebrates (Selander et al. 1987). L 56 Chapter 2 r Genetic and Phenotypic Variation 57 Although genetic polymorphisms are widespread, they are not universal. For example, both major subspecies of the cheetah Aunamfiix jitbafus are vir- tually monornorphic (O'Brien ct al. 1987). A survey of 49 enzymes among 30 animals from the East Alncan subspecies {A }. raiuci/i) yielded only two poly- morphic genes and estimates of polymorphism of 04 and heterozygosity of 0.0 1; among 98 animals trom the South African species (A j. jubntus), the esti- mate of polymorphism was 02 and that of heterozygosity 0004. Most unusual was the finding of skin-graft acceptance between unrelated cheetahs Irom the South African subspecies. Graft acceptance means that the cheetah population is monornorphic for the major histocompatibility locus, which is abundantly polymorphic in other mammals Apparently, the cheetah, which was worldwide in its range at one lime but presently numbers less than 20,000 animals, underwent at least two severe constrictions in population number resulting in the loss of most of its genetic variability How Representative Are AUozymes? The generality of estimates of polymorphism based on electrophoresis is somewhat uncertain. The amount of polymorphism may be underestimated because convenlional electrophoresis fails to detect many amino acid replacements. For example, in a study of 14 myoglobin proteins from various species including cetaceans (whales, dolphins and porpoises), no more than eight could be distinguished by conventional electrophoresis; however, 13 could be distinguished by varying the pH value of the electrophoresis buffer (McLellan and Inouye 1986) Some amino acid replacements can be detected because they render the enzyme sensitive to high temperatures; a test for temperature sensitivity increased the number of identified alleles of the gene coding for xanthine dehydrogenase in Drasophila pseudoobscura from 6 to 37 and increased the estimate of average heterozygosity from 0.44 to 0.73 (Singh et al 1976). On the other hand, although more elaborate techniques reveal additional alleles of genes known to be polymorphic, thus increasing esti- mates of heterozygosity, genes classified as monornorphic by means of rou- tine electrophoresis tend to remain monornorphic, and so estimates of poly- morphism remain much the same as before. Electrophoretic surveys might also overestimate the amount of polymor- phism because the enzymes typically surveyed are those found in relatively high concentration in tissues or body fluids ("Group I enzymes") and often lack the high substrate specificity of enzymes implicated in central metabol- ic processes ("Group II enzymes"). For example, among 10 Group I and 11 Group If enzymes in Drosopliilu, estimates of polymorphism and heterozy- gosity were 0.70 and 0.24 in the former and 0.27 and 0.04 in the latter (Gilles- pie and Langley 1974). In summary, protein electrophoresis is a convenient method for delecting polymorphisms, but it is difficult to extrapolate from electrophoretic surveys of enzymes to the entire genome because the enzymes may not be representative Polymorphisms in DNA Sequences One inevitable limitation of protein electrophoresis is the inability to detect variation in a nucleotide sequence that does not alter the amino acid sequence. A polymorphism is silent if it is present in the coding region but does not alter the amino acid sequence; many nucleotide differences in third- rodon position are of this type. A polymorphism is noncoding if it affects nucleotides in noncoding regions such as the upstream region, the down- stream region, or introns. Silent and noncoding polymorphisms may have subtle effects on the organism, and the alleles may be affected by natural selection, the polymorphic alleles are silent ot noncoding only in the sense that they all code for the same amino acid sequence. An example of exten- sive silent polymorphism in Drosophila is illustrated in Figure 2.10 for alleles of the gene coding for alcohol dehydrogenase. This gene has an elec- trophoretic polymorphism that is widespread in natural populations with two predominant alleles, slow {Adh-S} and fast (Adh-F). The molecular dif- ference is that, in the fourth and last exon of the gene, the codon for amino acid number 193 in Adh-S is AAG (lysine) and in Adh-F is ACG (threonine). The enzymes differ not only in electrophoretic mobility The product of the fast allele has a greater enzymatic activity and is also synthesized in greater amount than that of the slow allele. The data in Figure 2.10 are derived from studies of RFLPs in the Adh region of 1533 flies isolated from 25 populations throughout eastern North America (Berry and Kreitman 1993). A total of 113 haploty pes were identi- fied. A haplotype is a unique combination of genetic markers present in a chromosome. In Figure 2.10, the haplotypes indicated with squares are Adh-F and those with circles are Adh-S. The number inside each symbol is the rela- tive abundance of the haplotype (1 being the most frequent, 2 the next most frequent, and so forth). A straight line connecting two haplotypes indicates that they differ by a single change. Figure 2.10 includes 93 haplotypes related to at least one other by a singe change; the other 20 haplotypes observed in the study include additional changes. The main point of the Adh example is that natural populations contain a great abundance of different types of nucleotide-sequence variation that does not affect ammo acid sequence. Nucleotide Polymorphism and Nucleotide Diversity Sequence data can be used quantitatively to estimate the level of genetic vari- ation at the nucleotide level. The data in Problem 2.4 are typical and so will be used to exemplify the calculations. The level of nucleotide polymor- phism, symbolized 9, is the proportion of nucleotide sites that are expected 58 Chapter 2 , 97 49 83 Figure 2.10 Haplotypes of alleles i n ihe Mh region of Drosophita melanogastn fioin the East Coast of North America Each line in the network connects two haplotypes differing by a single molecular difference. An additional 20 haplo- types, differing by more than one change from those in the network, are not shown. Squares indicate the Adli-F allele, circles the Adh-S allele (Fiom Berry and Kreitman 1993.) Genetic and Phenotypic Variation 59 to be polymorphic in any sample of size 5 from this region of the genome. The estimate equals the proportion of nucleotide polymorphism observed in the sample, often symbolized as S, divided by 24 where n is the size of the sample, fn this case, S = 16/500 = 0.032 for a sam- ple of size ji = 5, so that a, = 1/1 + 1/2 + 1/3 + 1/4 = 2.083 The estimate of 9, per nucleotide site, is therefore 0.032 2.083 = 0.015 2.5 As noted in Problem 2.4, the variance of 9 is not binomial because, owing to genetic linkage, successive nucleotides cannot be regarded as realizations of independent trials. An approximation to the variance can be derived under the assumption that the nucleotides at a site are functionally equivalent or invisible to natural selection; the mathematical details are beyond the scope of this book, but the result is quite simple. The variance of 6, per nucleotide site, is given by V(B) = e A 7 e 2 ka l 2.6 where « t is as defined in Equation 2.4, k is the number of nucleotides in each sequence (in our example, k = 500), and a 2 is a function of the number of alle- les n in the sample, namely -It; 27 For n = 2 through 10, the values of a 2 are 1, 1.25, I 36, 1.42, 1 46,1.49, 1 51, 1-523, 1.54. fn the case at hand, n = 5 and the estimated variance of 9 = 0.015/(500x2.083) + 1.42 x 0.015 2 /2.083 2 = 9.2131 x 10 5 . The standard error of 9 is the square root of the variance or, in this case, 0096 per nucleotide site. A second quantity used to assess polymorphisms at the DN A level is the nucleotide diversity, typically denoted k, which is the average pioportion of nucleotide differences between all possible pairs of sequences in the sample In a sample of n sequences, there are n{n - l)/2 pa it wise comparisons. For the data in Problem 2.4, n = 5, and so there are 1 pair wise comparisons. The pairwise comparisons may be considered for each nucleotide in turn and the 60' Chapter 2 differences averaged latei. For the polymorphic sites in Problem 2 4, the num- ber of pairwise differences is 6 (= 2 x 3) lor sites 132, 142, 246, 351, 405, and 483; it is 4 (= 1 x 4) for sites 162, 198, 201, 207, 240, 354, 372, 375, and 417, and it is 7 for site 192 Among the 484 monomorphic nucleotides in Problem 2 4, the number of pairwise differences is 0. The average proportion of pairwise differences between the sequences in the sample is the estimate — ft — of the nucleotide diversity; hence, Jt = (6x6-f-4x9 + lx7 + 0x 484)/(10 x500) = 0.016 The variance of ft is estimated as follows: Var{ii) = — it + frijr k where k is again the length of the sequences m nucleotides and where n + 1 />, =- lh = 2(ti 2 +n + 3) 9»{/f-l) 2.9 2 10 For example, when n = 5, then b, =05 and b 2 = 0.37, and so Varfit) = (0.5/500) x 0.016 + 0.37 x 0.01 6 Z = 0.0001 07; the standard error of ft is the square root, or 0.010. The estimates of 9 and ji based on nucleotide sequences are not readily convertible lo levels of polymorphism and heterozygosity expected at the protein level The main reason is that most observed nucleotide polymor- phisms are either silent or noncoding and so do not change the amino acid sequence of the polypeptide. The level of protein polymorphism is deter- mined to a large extent by the degree to which the amino acid sequence is constrained by natural selection against variant sequences (or, in some cases, by natural selection for variant sequences), and constraints at the protein level are not generally predictable from 6 and ji. On the other hand, there is a theoretical relation between B and it that is expected under the simplifying assumption thai the alleles are invisible to natural selection The theoretical basis of relation between 9 and n is dis- cussed in connection with the neutral theory ol molecular evolution in Chap- ter 8, but the expected relation is that 8 = n For the data in Problem 2.4, for example, 6 = 0.015; this number is to be compared with ft = 0.016, and so the agreement with expectation is quite good. {On the other hand, the sample size is very small.) Estimates of nucleotide polymorphism and diversity can also be carried out with restriction-site data in the form of restriction fragment length poly- Genetic and Phenotypic Variation 61 morphisms (RFTPs). The simplest way to proceed is to analyze the restriction sites in turn Each monomorphic restriction site is regarded^ identifying six adjacent monomorphic nucleotides (or four monomorphic nucleotides, if the eivyme has a four-base restriction site). Each polymorphic restriction's! re is regarded as identifying five monomorphic nucleotides and one polymorphic nucleotide (or three monomorphic and one polymorphic, if the enzyme has a four-base restriction site). In other words, each restriction site polymorphism is supposed to result from polymorphism of a single nucleotide in the restric- tion site. Pairwise comparisons to estimate n are carried out under this assumption. The reasoning is illustrated in the following problem. PROBLEM 2.5 Restriction-site variation was studied around the '■• gene for alcohol dehydrogenase (Adh) iri a population of D, MWQ#¥& I de^cettotM: front artimaVtrap|>ed at a Dutch fruit market' ^^^W*^ (GlMS l ■*& BWey 1986J. the region contained a total of pi;2|,^i|;:for five: | re&tricjfkm eijzytne^ ejJch having a six-base, restriction P^fife^v^ 1 ° f 16 sites weie but' iri nil flies in the sample. The a'ceoiri- L>;$*W% "P^¥ documents the presence (*j or absence (-} of each of the £( ^^I^J^^^^ 1 ^^ ^^ ^ **' .^^^fe of' 10 cWcimcwbines. Estirhate the p'^^W^ypo^^^CftUcIeotidw §,the nucleotide diversity fa K''u : ',i#4 the standard error of each. Does the relation & = it seem to hold hV^ftiie estimates? - Consider first the nucleotide pdfymorphisms. The 16 _ & sites identify 16 x 6 = 96 monomorphic nucleotides; the lofphie sites identify 7 x 5 = 35 monomorphic sites and 7 x 1 62 Chapter 2 polymorphic nucleotides {assuming only 1 nucleotide is altered for each restriction site that is lost). Altogether, there are 138 nucleotides of which 7 are polymorphic. Because n = 10, then rt l = 2.83 and h z = 1.54 The estimate of is therefore 9 * (7/ 138)/ 2.83 = 0.0179 per nucleotide site and Var{9) = 0.0179/(138 x 2.83} f (1.54 x 0.0179 2 /2.83 2 ) = 1.0778 x 10 -4 , The standard error of 8 is therefore 0.0104 per nucleotide site. For estimating it, there are 10 x 9/2 = 45 pairwise comparisons, and a restriction site with i "plus" and (10 - 1) "minus" means that the polymorphic nucleotide site results in i x (10 - i) pairwise mismatches. Therefore, the total number of mismatches for each of the restriction sites, from left to right, equals 16, 24, 9, 16, 9, 9, and 21, respectively, totaling 104, In addition, there are 16 x 6 nucleotides (from the monomorphic sites) and 7 x 5 nucleotides (from the polymorphic sites) for which the number of pairwise mismatches equals 0. Therefore, k = 104/(45 x 23 x 6) = 0.017. For n = 10, b x = 0.407 and b 2 = 0.279, and so Var(ft) = 0.0001277; hence the standard error equals 0.011. In these data, = 0.018 and it = 0.017, which are in very good agreement. However, the sample size is too small to generalize this conclusion. Uses of Genetic Polymorphisms Whether studied through allozyrues or nucleotide sequences, natural genet- ic variation has many uses Genetic variation provides a set of built-in mark- ers for the genetic study of organisms in their native habitats, including organisms (or which domestication or laboratory rearing is unfeasible or for which conventional genetic manipulation is impossible. Genetic polymorphisms are useful in investigating the genetic relation- ships among subpopulations in a species. The principle is that alleles are shared among subpopulations because of migration, and therefore similarity in allele frequencies among subpopulations can be used to estimate the rale of migration (Chapter 4). Within subpopulations, alleles are shared because o( common ancestry. For example, the Ainu people of Northern Japan have numerous Caueasoid-hke features, including their facia! features, light skin, and hairy bodies, yet their genetic polymorphisms clearly show Ihem to be more closely related to other Mongoloid groups (Watanabe et al 1975). Among the most informative alleles, the Ainu people possess the D(Cht) allele of transferrin protein and the Di" allele of the Diego blood group, both of which are virtually restricted to Mongoloid populations. Conversely, the Ainu people lack several alleles that are polymorphic in Caucasoids. Genetic and Phenotypic Variation 63 From a practical point of view, genetic polymorphisms are useful in human populations as genetic markers that may be genetically linked to harmful genes that cause disease. In kinships with a family history of the dis- ease, the genetic markers can be used to determine which members of the kindred are likely to be carriers of the harmful gene. The markers can also be used in early diagnosis of persons likely to be affected. RFLPs and other types of DNA polymorphisms that are linked to disease genes have also demonstrated their utility as probes for identifying recombinant DNA clones containing the defective genes. The nearby genetic markers enable the defec- tive gene and its function to be identified, thus serving as a first step in the search for effective treatments. Particularly useful in population genetics are DNA markers with a large number of alleles of moderate frequency. In most organisms, many regions of the genome have multiple alleles consisting of a short sequence of bases repeated in tandem. Multiple alleles result because the number of copies of the repeated sequence may differ from one chromosome to the next The genotypes are even more variable because each genotype carries two alleles One of the practical applications of the use of such polymorphisms is in DNA typing, in which the alleles in the DNA from a suspect are matched with those from a crime-scene sample. The examination of a sufficient number of such highly variable regions provides a basis for distinguishing one person from another because no two people (with the exception of identical twins) have the same genotype. Genetic variability of this sort is used in determining paternity as well as in criminal investigations. The experimental methods of DIM A typing, and certain relevant issues in population genetics, are discussed in Chapter 4. DNA typing has also been applied to studies of the natural mating sys- tems of plants and animals because, with the large number and high speci- ficity of DNA types, close relatives can be detected in popufations. In behavioral studies, DNA typing can determine whether organisms that per- form mutually altruistic acts are genetically related. Polymorphisms of other types can also be informative about mating systems. For example, the observed frequencies of genotypes can be used to estimate the amount of self-fertilization in populations of monoecious plants or hermaphroditic animals. From the standpoint of evolutionary biology, sequences of genes and pat- terns of polymorphism can be used to make inferences about evolutionary history and about the evolutionary process The sequences ol macromole- cules contain within themselves a record of their evolutionary history Organ- isms with a shared ancestry usually have similar gene sequences Conversely, similarity in sequence can be regarded as a measure of shared ancestry. As an index of shared ancestry, sequence similarity provides a means of inferring the ancestral relationships among a group of organisms {iiiolaular ;>hy!o$e- 64 Chapter 2 Genetic and Phenotypic Variation 65 nvtics, discussed in Chapter 8). The rates and patterns of change in sequence within species and between closely related species also contain a record of evolutionary forces at work Within the past 20 years, population genetics has gone from a data-poor field to a data-rich field, and numerous new methods of data analysis and hypothesis testing have been developed. MULTIPLE FACTOR INHERITANCE We have seen that Galton and Mendel chose opposite types of traits for their studies of variation: Galton chose continuous traits, Mendel discrete traits. The choices reflected a deep difference of opinion in the manner in which inheritance should be studied. Galton's approach was empirical, based on the observed similarity between relatives such as parents and offspring Mendel's approach was theoretical, based on unobserved segregating factors that determined the patterns of inheritance. Even after the rediscovery of Mendel's paper in 1900, the disciples of Galton (called "biometricians") dis- missed its significance, claiming that the postulated Mendelian factors were not only irrelevant for continuous traits but also inadequate to explain the observed correlations between relatives. The Mendelians argued that segre- gation and independent assortment could explain continuous traits just as well as discrete traits. The acrimonious dispute between the biometricians and the Mendelians continued for nearly 20 years. The dispute abated substantially with a 1918 paper by the statistician Ronald Aylmer Fisher (1890-1962) entitled "The correlation between rela- tives on the supposition of Mendelian inheritance." Fisher examined a math- ematical model of multifactorial inheritance and deduced the expected conelalions between relatives. He showed that the kinds of data available for continuous traits were not only compatible with Mendelian inheritance but were also predicted by it. The spirit of Fisher's model is shown in Figure 2 11, which illustrates the genetic variation expected among the progeny of a cross between genotypes that aie heterozygous for each of three unlinked genes The alleles of the genes are represented A/a, B/b, and C/c, and the genetic variation resulting from segregation and independent assortment is evident in the various degrees ol shading. If we assume a trait in which each uppercase allele adds one unit to the phenotype and in which each lowercase allele is without effect, then the aa bb cc genotype has a phenotype of and the AA BB CC genotype has a phenotype of 6. Thus there are seven possible phenotypes (0-6) among the progeny The distribution of phenotypes is shown in the bar graph in Figure 2.12 The smooth curve is the normal distribution approx- imating the data, which has a mean of 3 and a variance of 1.5 In Figure 2.10, we have assumed that all of the variation in phenotype results from differ- ences in genotype If there were also random environmental! factors affecting the trait, as well as a greater number of genes, then the bars in Figure 2.12 ♦ ABC AfiC ♦ a he ABC Figure 2. 11 Result of segregation of three independent pairs of alleles affect- ing the same trait. Each allele that is indicated by an uppercase letter is assumed to contribute one unit to the phenotype. The phenotypes range from to 6 and, in the cross between triple heterozygolcs, are formed" in the proportions 1 6.15. 20:15 6:1 V l would become less distinct and a normal distribution approximated even bet- ter. The result is the central limit theorem at work producing Galton's "supreme law of unreason." Fisher's model was a good deal more complex than that in Figure 2.10, allowing for differences in the effects of alleles, differences in allele frequen- 66 Chapter 2 Figure 2.T2 Distribution of phenotypes from the cross in Figure 2.11 and the approximating normal distribution. The normal curve has mean 3 and vari- ance 1 5. cy, various types of dominance relations, and the effects of random environ- mental factors. The work was pathbreaking in demonstrating that continu- ous variation could be explained by multiple interacting Mendelian factors. Fisher's model was complex for its time and the paper a difficult one. It is not clear even now what practical role Fisher's paper may have played in ending the controversy between the biometricians and the Mendelians. Not many people seem to have read it. On the other hand, it is the seminal paper that marked the reconciliation of the theories of Gallon and Mendel. SUMMARY Galton examined the statistical relations between the distributions of phe- notypic traits in successive generations. Most of the traits he studied were continuous traits, like height or weight, which are measured on a quantita- tive scale Galton was very taken with the observation that the phenotypes of many continuous traits are distributed according to the bell-shaped curve known as the normal distribution. The peak of the normal distribution is determined by the mean and the spread is determined by the variance. Phenotypic variation in natural populations is usually in the form of differ- ences in continuous traits. Most continuous traits are also multifactorial, that is, determined by the combined effects of multiple genetic and environmen- tal factors. The normal distribution is often encountered in practice because of the central limit theorem, which states that the limiting distribution of the sum of a large number of independent random quantities is normal. Genetic and Phenotypic Variation 67 Mendel studied discrete variation, such as round versus wrinkled peas, resulting from segregation of the alleles of a single gene Simple Mendelian variation is the rule for genes and their products. Genetic variation in protein molecules can be identified by such techniques as protein electrophoresis. Proteins differing in electrophoretic mobility that are coded by alternative alleles of the same gene are called allozymes. Allozyme variation is wide- spread in most organisms. Based on electrophoretic surveys of human popu- lations, about 30% of all enzyme-coding genes are polymorphic (in the sense that the most common allele has a frequency less than 0.95), and about 7% of the loci are heterozygous in an average person. Plants and invertebrates have even higher levels of allozyme variation. Although there is wide variation among species, Drosophila averages about 40% polymorphic loci with an aver- age heterozygosity of 14%. Genetic variation at the DNA level can be detected with the Southern blot procedure, in which DNA fragments produced by a restriction enzyme are separated by electrophoresis and identified by hybridization with a homolo- gous labeled probe sequence. Polymorphisms in the length of restriction fragments (restriction fragment length polymorphisms) are abundant throughout the genome and have applications in studies of genetic linkage in many organisms. DNA studies are also often carried out with the polymerase chain reaction (PCR), in which multiple cycles of primer annealing, DNA replication, and strand separation are used to exponentially amplify the DNA sequence flanked by the oligonucleotide primers. Amplified DNA may be sequenced, used as a probe, or manipulated in other ways. Polymorphisms in nucleotide sequence are abundant in natural popula- tions, particularly in noncoding regions and at silent sites in coding regions (especially at third codon positions, in which a nucleolide substitution need not result in an amino acid replacement). For a sample of DNA sequences, the amount of nucleotide polymorphism is the proportion of nucleotide sites occupied by two or more bases (A, T, G, C) in the sample. The nucleotide diversity is the average proportion of nucleotide differences between all sequences in the sample taken in pairwise comparison. The estimates of nucleotide polymorphism and nucleotide diversity are not readily compared with allozyme data because much of the observed sequence variation is either noncoding or silent. There is often a disconnect between molecular variation and phenotypic variation because differences in phenotype among healthy organisms cannot usually be attributed to differences in specific molecules. Indeed, there is a sort of disconnect between simple Mendelian inheritance and continuous variation because the segregation of any pair of alleles affecting a continuous trait is obscured by the segregation of other pairs of alleles as well as by the effects of the environment. In the early years after the rediscovery of Mendel's paper, there was considerable controversy whether Mendelian 68 Chapter 2 factors could account for the patterns of variation and correlation among rel- atives noled by Galton and others. The issue was resolved theoretically by R. A. Fisher's 1918 paper on the correlation between relatives on the supposi- tion of Mendelian inheritance. Closing the gap between the study of evolu- tion at the level of phenotypes and at the level of molecular genotypes remains one of the major challenges in population genetics. PROBLEMS 1. Shell widths of mussels are approximately normally distributed. If the mean is 70 mm and the standard deviation is 10 mm, what fraction of the population is smaller than 80 mm? 2. Following Problem 1, what fraction of the population is between 80 and 90 mm in width? 3. Calculate the mean, variance, standard deviation, and standard error of the mean for the following bristle counts: 13, 14, 13, 15, 14, 15, 12, 13, 14, 16,12,15,13,14. 4. Measurements of body weight of a very large sample from a species of mouse have a mean of 60 g and variance of 64 g 2 . In one area it was sus- pected that environmental contamination had reduced the size of the mice. A sample of 100 mice from this area had a mean of 58 g and a sam- ple variance of 64 g 2 . Is this sample population significantly smaller in size than the population examined with the very large sample? 5. A standard means for using a computer to generate normally distributed random numbers is to take 12 uniform random numbers and add them up. After scaling the sum by a constant that depends on the mean and the variance, the result represents a sample from the normal distribution one wants. Why does this approach work? 6. One statement of the central limit theorem is that the sum of indepen- dent, identically distributed random variables has a limiting normal dis- tribution. If the variables that are being added exhibit positive covariance in successive measures (as opposed to being independent), how would the sum deviate from the normal distribution predicted by the central limit theorem? 7. Allozyme gels reveal a sample with 64 FF, 32 FS and 4 SS females, but there seem to be 40 FF males and 10 SS males with no heterozygotes. How do you explain these data? 8. Many proteins exist in an active form only as dimers, with two molecules joined either by hydrogen bonding or even by covalent cysteine bridges. If an enzyme is only active as a dimer, and there is electrophoretic varia- tion in a population with two alleles (F and S), what do you think a het- erozygote would look like on a gel? What would a heterozygote look like if only tetramers were active? Genetic and Phenotypic Variation 69 9. How many copies of a fragment of DNA should be present after 30 rounds of PCR, assuming perfect efficiency? 10. Taq polymerase does not have perfect fidelity in copying DNA sequences, and the result is that PCR products have some variation in sequence' Why do you suppose it is still possible to sequence PCR-amplified DNA to obtain the true sequence? When might the errors caused by Tacj poly- merase cause errors in the final sequence? 11. Many new ways of scoring DNA variation at individual nucleotide sites are becoming available, including an oligonucleotide ligation assay, "Taqrnan," template-directed dye-terminator incorporation (TDI), and hybridization to dense oligonucleotide arrays known as DNA chips. An important criterion for the utility of any of these methods is that it must be very accurate. Why is accuracy so critical? 12. Four sequences of a 1200 bp gene gave the following counts of pairwise differences: 4, 7, 5, 3, 6, 5. What is the estimate of nucleotide diversity for this sample? 13. In forensic applications of genetics, if the DNA types from a crime scene and a suspect do not match, the confidence one has in the conclusion is much greater than if the types do match. Why? CHAPTER 3 Organization of Genetic Variation Random Mavng - Hardy-Weinberg Principle Chi-square Test Multiple Alleles • Linkage Disequilibrium Q HE word population HAS so far been used in an informal, intuitive sense to refer to a group of organisms belonging to the same species. Further discussion and clarification of the concept is nec- essary at this time. In population genetics, the word population does not usually refer to an entire species; it refersjLnstead to a group of organisms of the same species livijigjwithina sufficiently restricted geographical area that .my member can potentially mate with any other member (provided that they areof thejapposite sex). Precise defmition of such a unit is difficult and vftnrs from species to species because of the almost universal presence of ■Hunc sort of geographical structure in species — some typically nonrandom pat- fern in the spatial distribution of organisms. Members of n species are rarely distributed homogeneously in space: there is almost always some sort of clumping or aggregation, some schooling, flocking, herding, or colony for- mation Population subdivision is often caused by environmental patchiness, <iri\ns of favorable habitat intermixed with unfavorable areas. Such environ- mental patchiness is obvious in the case of, for example, terrestrial organisms »n islands in an archipelago, but patchiness is a common feature of most Jinhitats— freshwater lakes have shallow and deep areas, meadows have marshy and dry areas, forests have sunny and shady areas. Population sub- p'' . ,S1 °k C3n a ' S ° ^ e cause( * ^y social behavior, as when wolves form packs. ■vi-n the human population is clumped or aggregated— into towns and u,l ^™ay from deserts and mountains. 71 72 Chapter 3 The local interbreeding units of possibly large, geographically structured populations are of some interest because it is within such local units that adaptive evolution takes place through systematic changes in allele frequen- cy. Such local interbreeding units— often called local populations or denies— are the fundamental units of population genetics. Local populations are the actual, evolving units of a species. Unless otherwise specified (or clear from context), the term population as used in this book means local population. Local populations are sometimes also referred to as Mendelian populations or subpopulations. RANDOM MATING In sexual organisms, genotypes are not transmitted from one generation to the next. Genotypes are broken up in gamete formation by the processes of segregation and recombination, and they are assembled anew in each gener- ation in fertilization: genotypes -> gametes -> genotypes. The frequency of a specified genotype in a population is the genotype frequency. The formation of a genotype in newly fertilized eggs is determined by the opportunity for the relevant gametes to come together in fertilization, and the opportunity for gametes to come together in fertilization is determined by the matings that take place among organisms of reproductive age in the previous generation. To put the matter in a slightly different way, the genotypes of the mating pairs determine the genotypes of the progeny. Furthermore, there are mathe- matical relationships between the frequencies of mating pairs and the fre- quencies of progeny genotypes. Such mathematical relationships are usually inferred from models in which the types of matings in the population are specified. One of the important models in population genetics is that of ran- dom mating, in which mating pairs have the same frequencies as if they were formed by random collisions between genotypes. The chance that an organ- ism mates with another having a prescribed genotype is therefore equal to the frequency of the prescribed genotype in the population. For example, suppose that in some population the genotype frequencies of AA, Aa, and aa are 0.16, 0.48, and 0.36, respectively; if mating is random, AA males mate with AA, Aa, and aa females in the proportions 0.16, 0.48, and 0.36, respectively; these same proportions apply to the mates of Aa and aa males. Superficial appearances to the contrary, random mating is not a simple or trivial process. One complication is that random mating depends on the trait: mating can be random with respect to some traits but nonrandom with respect to other traits at the same time and in the same population. For exam- ple, it is perfectly consistent for a human population to undergo random mat- ing with respect to blood groups, allozyme phenotypes, restriction fragment length polymorphisms, and many other characteristics, but at the same time to engage in nonrandom mating with respect to other traits such as skin color Organization of Genetic Variation 73 d height. A second complication is population substructure Paradoxical as ■'" ay seem, random mating may be observed within each of the subpopu- I'tions constituting a larger population, but random mating may still fail to h ild in the population as a whole. (The reason for this paradox is discussed Chapter 4) In spite of these and other complications, random mating plays ' n important role in models in population genetics because random mating often serves as a point of departure for considering more realistic situations. Nonoverlapping Generations One of the most important mathematical models in population genetics is the nonoverlapping generation model, in which the cycle of birth, maturation, and death includes the death of all organisms present in each generation before the members of the next generation mature. The nonoverlapping gen- eration model is diagrammed in Figure 3.1. The model applies literally only lo organisms with a very simple sort of life history, such as certain short-lived insects or annual plants that have a short growing season. In such plants, all members of any generation germinate at about the same time, mature togeth- er, shed their pollen, are fertilized almost simultaneously, and die immedi- ately after producing the new generation. This sort of hypothetical population, with its simple life history, is used in population genetics as a first approximation to populations that have more complex life histories. Although at first glance the model seems hopelessly oversimplified, calcula- Birth Maturation Reproduction Death Generation t - 1 Birth Maturation Reproduction Death Birth Maturation Reproduction Death Generation t Generation/ + 1 oSw nonoverlapping generation model. The life history of the ore'ln v' S a5 JT l ° be Uke that of an annual P ,anf (° r an V short-lived ennnrJ 1 ! a v a , , 8 eneration s are assumed to be separated in time (discrete ZlZl ITr ' Allhou § h the model is simple, it provides a convenient first \ \ n ximation to populations with more complex life histories. 74 Chapter 3 lions of expected genotype frequencies based on the model are adequate for many purposes. In some applications, the nonoverlapping generation model turns out to be a useful approximation even for populations with a long and complex life history such as human beings. The Hardy-Weinberg Principle Genotype frequencies are determined in part by the pattern of mating. In this section, we consider the consequences of random mating in the model with nonoverlapping generations. To deduce the genotype frequencies under ran- dom mating, additional assumptions are needed. First, the allele frequencies should not change from one generation to the next because of systematic evo- lutionary forces, the most important of which are mutation, migration, and natural selection. For the moment, these evolutionary forces are assumed to be absent or negligibly small in magnitude. (Their effects are discussed in Chapters 5 and 6.) Second, the population must be large enough in size that the allele frequencies are not subject to change merely because of sampling error. Variation in allele frequency owing to sampling error in small popula- tions is called random genetic drift and is the subject of Chapter 7. Although random genetic drift is present unless the population is infinite in size, the magnitude of the effect on allele frequency over a small number of genera- tions is usually sufficiently small that the process can be ignored if popula- tion size is 500 or more. The qualifier "over a small number of generations" is important because the effects of random genetic drift are cumulative. Con- sidered over a sufficiently large number of generations, random genetic drift can be important even in populations of size 10 6 or more. Before proceeding further, it may be helpful to summarize the assump- tions that we are making: • The organism is diploid. • Reproduction is sexual • Generations are nonoverlapping. • The gene under consideration has two alleles. • The allele frequencies are identical in males and females, • Mating is random. • Population size is very large (in theory, infinite). • Migration is negligible. • Mutation can be ignored. • Natural selection does not affect the alleles under consideration. Collectively, these assumptions summarize the Hardy-Weinberg model, named after the English mathematician G. H. Hardy {1877-1947) and the German physiologist Wilhelm Weinberg (1862-1937), who, in 1908, indepen- dently formulated the model and deduced its theoretical predictions of geno- type frequency. Organization of Genetic Variation 75 In the Hardy-Weinberg model, the mathematical relation between the allele frequencies and the genotype frequencies is given by AA:p 2 Aa:2pq aa:q 2 31 in which p 2 2pq, and q 2 are the frequencies of the genotypes AA, An, and aa in zygotes of any generation, p and q are the allele frequencies of A and a in gametes of the previous generation, and p + q = 1 . The frequencies displayed in Equation 3.1 constitute the Hardy-Weinberg principle or the Hardv- Weinberg equilibrium (HWE). y One rationale for the Hardy-Weinberg principle displayed in Equation 3 1 is based on the outcome of repeated and independent trials. With random mating the choices of male gamete and female gamete are independent tri- als, and so pairs of gametes carrying the alleles AA, Aa, or aa are expected in proportions given by (p A + q af = p 2 AA + 2pq Aa + q* «. A graphfcal illus- tration of the rationale of independent trials is shown in Figure 3 2 The chance of two A-bearing gametes coming together is p x p = p 2 and that of two fl-beanng gametes coming together is q x q = q 2 ; for the heterozygote, the chance ispxq + qxp^lpq because the female gamete could carry A and the male gamete carry a, or the other way around. Male gametes Allele A Frequency p Allele Frequency A p Female gametes AA Aa aA aa Summed frequencies in zygotes: AA. P' = p 2 Aa Q' = pq + qp = 2pq aa: R' = q z raKnl? Cro ^-multiplication square showing Hardy-Weinberg freqi resulting from random mating with two alleles. 76 Chapter 3 TABLE 3. 1 DEMONSTRATION OF THE HARDY-WEINBERG PRINCIPLE Frequency of zygotes (progeny) Mating frequency or mating (parenti) AA 4o aa AAxAA P 2 1 AA xAa 2PQ v 2 y 2 AA x aa 2PR i Aa xAa Q 7 v 4 i / 2 V* Aa xaa 2QR y 2 % aa xaa R 2 1 Totals (next generation) P' Q' R' therefore P' = P 2 + 2PQ/2 + Q 2 /4 = (P f Q/2) 2 = p 2 Q' = 2PQ/2 + 2PR + Q 2 /2 + 2QR/2 = 2(P + Q/2)(R + Q/2) = 2^ R' = Q 2 /4 + 2QR/2 + R z = (R + Q/2) 2 = q 2 Random Mating of Genotypes versus Random Union of Gametes Figure 3.2 implicitly assumes an important premise: that random mating of genotypes is equivalent to random union of gametes. A demonstration of this premise in the case of two alleles is outlined in Table 3.1, in which pairs of genotypes are chosen at random to form matings. The genotype frequencies of AA, Aa, and aa in the parental generation are written as P, Q, and R, respectively, where P + Q + R = l.ln terms of the genotype frequencies, the allele frequencies p of A and q of a are as follows: p = (2xP + Q)/2 = P + Q/2 q = (2xR + Q)/2 = R+Q/2 3.2 Note that p + q = P + Q + R = 1.0; this result is a consequence of the fact that the gene has only two alleles. With two alleles of a gene, there are six possible types of matings. When mating is random, these mating types take place in proportion to the geno- typic frequencies in the population, and the types of mating pairs are given by successive terms in the expansion of (P AA + Q Aa + R aa) . For example, the proportion of AA x AA matings is P x P = P 2 . Similarly, the proportion of AA x Aa matings is 2 x P x Q because the mating can include either an AA male with an Aa female (proportion P x Q) or an Aa male with an AA female (proportion Q x P). The frequencies of these and the other types of matings are given in the second column of Table 3.1. Organization of Genetic Variation 77 The genotypes of the zygotes produced by the matings are given in the last three columns ol Table 3.1. The offspring frequences follow from Mendel's law of segregation, which states that an Aa heterozygote produces an equal number of A-bearing and o-bearing gametes. The AA and aa homozygotes produce only ^-bearing and only a-bearing gametes, respec- tively. Thus, the mating AA x aa produces all Aa zygotes, the mating AA x Aa produces >/ 2 AA and V 2 Aa zygotes, the mating Aa x Aa produces '/ 4 AA, i/ 2 Aa, and % aa zygotes, and so forth. The genotype frequencies of AA, Aa, and aa zygotes after one generation of random mating are denoted in Table 3.1 as P', Q', and R', respectively. These values are calculated as the sum of the cross-products shown at the bottom of the table. The genotype frequencies simplify to P' = p 2 ,Q' = 2pq, and R' = q , where p and q are the allele frequencies given in Equation 3.2. Note that the parental genotype frequencies—?, Q, and R~ were completely arbitrary except for the requirement that P + Q + R = 1. Therefore, the Hardy- Weinberg frequencies are attained after one generation of random mating irrespective of the genotype frequencies in the parental generation. PMkm tJi A WS^ftiSiie site for. M rtM4ettdA efwiyiiffe ■ , mi is fereil' L #lih&;i'i|P';&ft|i , 1 6f the larval tos^is^ tjftftfc eerie '■ ; '$#$,$* |W;^ij|^ t>, mhnogastm Q^Mkt this '.^Wmjtmm'\st B:$§ft[mM$im$B isolated tt6m a pWitatM : '^W^^^^m^m^ml^M^ North Carolina (IMtnaW <WAgtt4« im).:iM$i\§ f ^JbibpmiA the present &t absent^ 'jMHtfli tftftt&jHfc ite'lftt^ifttosotiie, and aaiuifo. tlu*fc». ■fmm^'^^ ffequ*Mei kfcflate the expected irethieririefe^' >«Xa.' «&Wrfi in ,t -i' i lWnlffn., 1 i) '^%<^% ^f r amp 1 * 0.23 M t tyy * £} JQ m t and tf = 0.27 $. | genotype tttqmd& foi: $m alleles, &5 F and E6 S , of ttK! ieM &ding p " ft* efcteta^e we&imM fa tie riiiiitent with Hardy-Whii% j>h> s pmtions with allel* freqiiehdil of 0.3579 for E6 F an^ 0.^421 fof E6 8 78 Chapter 3 (Ivlukai et al. 1974). Assuming that all of the assumptions of the Hardy-Weinberg model hold, particularly those pertaining to random mating in a large population with no mutation, selection, or migra- tion, make a iable of mating frequencies similar to Table 3.1 for the esterase-6 alleles. Then calculate the genotype frequencies expected in the next generation along with the corresponding allele frequencies. ANSWER The Hardy-Weinberg frequencies among parents are FF: 0.1281; FS: 0.4596, and SS: 0.4123. Therefore, the expected frequencies of the matings are: FF x FF (0.0164); FF x FS (0.1177); FF x SS (0.1056); FS x FS (0.2112); FS x SS (03790); and SS x SS (0.1700). The expected genotype frequencies among the zygotes are, for FF, 0.0164 + 0.1177/2 + 0.2112/4 = 0.1281; for FS r 0.1177/2 + 0.1056 + 0.2112/2 + 0.3790/2 - 0.4596; for SS, 0.2112/4 + 0.3790/2 + 0.1700 = 0.4123; note that these are the same as in the parental generation. The allele frequencies of F and S are again 0.3579 and 0.6421, respectively. PROBLEM 33 Use a cross-multiplication square like that In Figure 3.2 to show that, when the allele frequencies differ in. male and female parents, the Hardy-Weinberg frequencies are not attained after one generation of random mating. Use the symbols p m and q m for the fre- quencies of A and a in male gametes and the symbols p f and ft for the frequencies of A and a in female gametes. After the first generation of random mating, what are the genotype frequencies in male and female zygotes? What are the allele frequencies in male and female zygotes? What are the genotype frequencies in zygotes after the sec- ond generation of random mating? Are these in Hardy-Weinberg pro- portions? ANSWER This problem demonstrates the principle that, with ran- dom mating, the frequency of an allele in zygotes equals the average of the allele frequencies in the parents. If the allele frequencies in par- ents differ, then random mating results in Hardy-Weinberg propor- tions only after two generations. The first generation equalizes the Organization of Genetic Variation 79 allele frequencies in males and females, and the second generation yields the Hardy-Weinberg proportions. Using the suggested sym- bols, after one generation of random mating, the genotype frequen- cies are AA: p m x p,, Aa: p m xq f + q m x p f , and aa: q m x q f . These are not in the form x 2 , 2x{\ - x), and (1 - x) 7 unless p m = p f and q m = q { . How- ever, the allele frequencies have become equal in the sexes a I p = p m p f + {Paflf + Httfd/2 = (p m + p,)/2 and q = {q m + q s )/2. The HWE is reached in one additional generation of random mating, in which the geno- type frequencies in zygotes are f 2 , 2pq, and q 1 . Implications of the Hardy-Weinberg Principle The Hardy-Weinberg principle has provided the foundation for many theo- retical and experimental investigations in population genetics However, the theory is far from profound, and the applicability is far from universal. Hardy especially seems to have regarded the Hardy-Weinberg principle as virtually self-evident. He writes, "I should 1 have expected the very simple point which I wish to make to have been familiar to biologists." In fact, it was familiar to some biologists — the basic principle had been noted as early as 1903 by the Harvard geneticist William E. Castle (1867-1962). Castle's work was little known, however, and Hardy was writing to counter an argument put forth against Mendelism that phenotypic ratios of 3 dominant to 1 reces- sive should be encountered frequently in natural populations il the mecha- nism of Mendelian heredity were generally applicable. The immediate implication of the Hardy-Weinberg principle was to refute the 3 . 1 argument by showing that the genotypic ratio of A~ : aa is determined by the allele fre- quencies and has no special tendency to attain one particular ratio as any other. Beyond the virtue of simplicity, why would anyone want to consider a model based on so many restrictive and seemingly incorrect assumptions? And in what sense can such a simple model be considered fundamental? Among several reasons, two stand out. First, the Hardy-Weinberg model is a reference model in which there are no evolutionary forces at work other than those imposed by the process of reproduction itself In this sense, the model is similar to models in mechanical physics where objects fall through the sky without wind resistance or roll down inclined planes without friction The model affords a baseline for comparison with more realistic models in which evolutionary forces can change allele frequencies. Perhaps more importantly, the Hardy-Weinberg model separates life history into two intervals- gametes -> zygotes and zygotes -> adults. In constructing more complex and realistic models, one can often introduce the complications into the zygotes -> adults 80 Chapter 3 part of the life cycle — lor example, in considering the effects of migration into the population or of differential survival among the genotypes. With all sources of change in allele frequency accounted for in the zygotes — > adults component, the gametes — > zygotes component follows from the principle that random union of gametes and results in the Hardy-Weinberg propor- tions among zygotes. In other words, the Hardy-Weinberg model is funda- mental in the sense that the approach of tracking allele and genotype frequencies through time can be generalized to more realistic situations. One of the most important implications ol the Hardy-Weinberg principle emerges when we calculate the allele frequencies of A and a in the next gen- eration from the formulas for V , Q', and R' in Table 3.1. Using the result in Equation 3.2, the allele frequency of A among the zygotes equals P' + Q'/2 - p 1 + 2p<j/2 = pip + q) = p. Likewise, the allele frequency of a among zygotes equals R' + Q'/2 = q 2 + 2pq/2 - q(q + p) = q. Thus, the allele frequencies in the next generation are exactly the same as they were the generation before. With random mating, the allele frequencies remain the same generation after gen- eration. In any generation, therefore, the genotype frequencies are p 2 , 2pq, and q 2 for AA, Aa, and aa, respectively, as given in Equation 3.1 . The constan- cy of allele frequency — and therefore of the genotypic composition of the population— is the single most important implication of the Hardy-Weinberg principle. The constancy of allele frequencies implies that, in the absence of specific evolutionary forces to change allele frequency, the mechanism of Mendelian inheritance, by itself, keeps the allele frequencies constant and thus preserves genetic variation. A second item of interest is that the Hardy- Weinberg frequencies are attained in just one generation of random mating if the allele frequencies are the same in males and females. This, however, is true only with nonoverlapping generations; in populations with more com- plex life histories, the Hardy-Weinberg frequencies are attained gradually over a period of several generations. It is important to note here that conventional statistical tests for Hardy- Weinberg proportions (such as the y} test discussed below) are not very sen- sitive to deviations from the expected genotype frequencies. Consequently, the mere fact that observed genotype frequencies may happen to fit the Hardy-Weinberg proportions cannot be taken as evidence that all of the assumptions underlying the model are valid. The most that can be concluded is that, whatever departures from the assumptions there may be, they are not sufficiently large to result in deviations from HWE that are detectable with conventional statistical tests. The Hardy-Weinberg Principle in Operation Application of the Hardy-Weinberg principle can be illustrated with data on the MN blood groups in a British population. In a sample of 1000 people (Race and Sanger 1975), the observed phenotypes were 298 blood group M Organization of Genetic Variation 81 (indicating genotype MM), 489 blood group MN (indicahng genotype MN) and 213 blood group N (indicating genotype NN) To determine whether these genotype frequencies are in accord with HWE, the allele frequencies of M and N must first be estimated, The estimated allele frequency ft of M is 1085/2000 = 0.5425 and that q of N is 915/2000 = 0.4575. (For the details see Problem 1.4 in Chapter 1.) Were the population in HWE, we would expect the genotype frequencies of MM, MN, and NN to be p\ 2pq, and q\ respectively where p and q are the allele frequencies in the underlying population from which the sample was drawn. Because p and q are parameters, their true val- ues are unknown. However, in testing for HWE we can substitute the esti- mated values to obtain the expected proportions MM: (0.5425) 2 = 2943 MN 2(0.5425)(0 4575) = 0.4964, and NN: (G.4575) 2 = 0.2093, respectively. Because the sample size is 1000, the expected numbers of the MM, MN, and NN geno- types are 0.2943 x 1000 = 294.3, 0.4964 x 1000 = 496.4, and 0.2093 x 1000 = 209.3, respectively. At this point, it is convenient to tabulate the data into three columns, the first giving the genotypes, the second giving the observed numbers, and the third giving the expected numbers: MM 298 294.3 MN 489 496.4 NN 213 209.3 With the data so arrayed, it is evident that the fit between the observed numbers and the expected numbers, though not perfect because of chance statistical fluctuations in the number of each genotype that may be included in any given sample, is nevertheless very close. To verify this conclusion, we will apply a conventional statistical test to the data in order to assess quanti- tatively the closeness of fit. A test commonly employed in population genet- ics is called the chi-square test, which is based on the value of a number, called x , calculated from the data as 2 _ y fobs - exp) 2 exp 3.3 where obs refers to the observed number in any genotypic class, exp refers to the expected number in the same genotypic class, and the I sign denotes that the values are to be summed over all genotypic classes. In the case at hand, X 2 = (298 - 294.3) 2 /294.3 t- (489 - 496.4) 2 /496.4 f (213 -209.3)7209.3 = 0.222 82 Chapter 3 Organization of Genetic Variation 83 To be completely unambiguous, some statisticians prefer use of the sym- bol X 2 for the realized value of the test statistic defined Equation 3.3, in order to distinguish between the test statistic and the true j£ 2 distribution itself. The distinction should certainly be kept in mind, but we will not recognize it for- mally with different symbols. Associated with any % 2 value is a second number called the degrees of freedom for that y^. In general, the number of degrees of freedom {df) associ- ated with a x 2 value equals df= Number of classes of data - Number of parameters estimated from the data -1 In the MN example, there are three classes of data and one parameter (p) estimated from the data, and so df =3-1-1 = 1. Note that a degree of free- dom is not subtracted for estimating q because of the relation q - 1 - p; that is, once p has been estimated, the estimate of q is automatically fixed, and so we deduct just I he one degree of freedom corresponding to p. Calculation of % 2 and its associated degrees of freedom is carried out in order to obtain a number for assessing goodness of lit; the number is deter- mined from Figure 3.3. To use the chart, find the value of x 2 along the hori- zontal axis, then move vertically from this value until the line for the number of degree of freedom is intersected, then move horizontally from the point of intersection to the vertical axis and read the corresponding probability value P, In our case, with % 2 = 0.222 and one degree of freedom, the corresponding probability value is about P = 0.67. The probability associated with a particu- lar x 2 test has the following interpretation: it is the probability that chance alone could produce a deviation between the observed and expected values at least as great as the deviation actually realized. Thus, if the probability is large, it means that chance alone could account for the deviation, and it strengthens our confidence in the validity of the model used to obtain the expectations — in this case, the Hardy -Weinberg model. On the other hand, if the probability associated with the X 2 is small, it means that chance alone is not likely to lead to a deviation as large as actually realized, and it under- mines our confidence in the validity of the model. Where exactly the cutoff should be between a "large" probability and a "small" one is, of course, not obvious, but there is an established guideline to follow. If the probability is less than 0.05, then the goodness of fit is considered sufficiently poor that the model is judged invalid for the data; alternatively, if the probability is greater than 0.05, the fit is considered sufficiently close that the model is not rejected. Because the probability in the MN example is 0.67, which is greater than 0.05, wp have no reason to reject the hypothesis that the genotype frequencies are in Hardy -Weinberg proportions for this gene. o nan 20 18 If. 14 12 104 8 7 Calculated x'vtihu' Figure 3.3 Graph of Z . To use the graph, find the value of y} along the hori- zontal axis, then read the probability value for the appropriate number of degrees of freedom from the vertical axis {From Hart) 1994 } PROBLEM 3.4 In the Ss blood group, related to the MN system, three phenofypes corresponding to the genotypes SS, Ss, and ss can be identified by appropriate reagents. Among the same 1000 British people who gave the MN data above, the observed number of each genotype for the Ss blood groups were 99 SS, 418 Ss, and 483 ss. L 84 Chapter 3 Estimate the allele frequency of S (p) and s (q) and carry out a % z test of goodness of fit between the observed genotype frequencies and their Hardy -Weinberg expectations. Is there any reason to reject the hypothesis of Hardy-Weinberg proportions for this gene? ANSWER p = 0.308 and q = 0.692. The expected numbers of SS, $$, and ss are 94.86, 426.27, and 478.86, respectively. The x* = 0.377 with one degree of freedom. The associated probability from Figure 3.3 is about 0.55, so there is no reason to reject the hypothesis of HWE. Complications of Dominance Dominance obscures the one-to-one relation between phenotype and geno- type, but the allele frequencies can still be estimated if one is willing to assume HWE. For a polymorphic gene with two alleles in which one of the alleles is dominant, only two phenotypic classes can be distinguished — the dominant phenotype and the recessive phenotype. An example is the D allele in the human Rh blood groups, which codes for an Rh ' antigen present on the surface of red blood cells. An alternative allele designated d, fails to code for the antigen. The allele D is dominant over d because both DD and Dd genotypes produce the Rh 4 antigen. The genotypes DD and Dd therefore have the Rh" phenotype and are said to be Rh positive; the dd genotype has the phenotype Rh and is said to be Rh negative. At the molecular level, the Dd genotype might be expected to produce only half as much antigen as DD because it contains only one D allele, but the phenotype is nevertheless Rh positive. Among American Caucasians, the frequency of Rh* is about 85,8% and the frequency of Rh" is about 14.2% (Mourant et al. 1976). Given only the phenotype frequencies, the data cannot be used to calculate the genotype fre- quencies because we have no way of knowing what proportion of Rh + phe- notypes are DD and what proportion are Dd. However, if we are willing to assume random mating, then the relative proportions DD and Dd genotypes are given by the Hardy-Weinberg principle. Assuming random mating and HWE, the genotype frequencies are given by p 1 , 2pq, and q 2 , where p is the allele frequency of D. An estimate of q can therefore be obtained by setting q = 142 (the frequency of the homozygous recessive phenotype), and so q = ■Jq, ] 42 = 0.3768. More generally, if R ss the frequency of homozygous reces- Organization of Genetic Variation 85 sive genotypes found in sample of n organisms, then q and its standard error are estimated as ■ = Vr SE(q) = 4u 3.4 With q estimated from Equation 3.4 as 0.3768, then p = 1 - 0.3768 = 0.6232, and the frequencies of DD, Dd, and dd are expected to be p 2 = (0.6232) 2 = 0.3884, 2pq = 2(0.6232)(0 3768) = 0.4696, and q 2 = (0.3768) 2 = 0,1420, respec- tively The proportion of Rh 4 people that are actually heterozygous is there- fore 0.4696/(0.4696 + 0.3884) = 54.7%. However, when there is dominance, there is no possibility for a % 2 test of goodness of fit to HWE because there are degrees of freedom. The lack of degrees of freedom is the reason why the calculated frequencies of Rh + and Rh' (0.3884 + 0.4696 = 0.858 and 0.142, respectively) fit the observed frequencies exactly. PROBLEM 3.5 The Basque people, who live in the Pyrenees moun- tains between France and Spain, have one of the highest frequencies of the d allele in the Rh system so far reported. In one study of 400 Basques, 230 were found to be Rh + and 170 Rh" (Mourant et al. 1 976). Estimate the frequencies of the D and d alleles, the genotype frequen- cies, and the proportion of Rh + people who are heterozygous Dtf. What is, the standard error of the estimate ^? ANSWER: q = V(170/400) = 0.65, p = 0.35, and the estimated geno- type frequencies of DD, Dd, and dd are 0.121, 0.454, and 0.425, respec- tively. The proportion of Dd among Rh + phenotypes in the Basque popula tion is 0.454/(0.12 1 + 0.454) = 79%. The standard error of 3 equals V[(l - 0.425)/1600] = 0.02. The Hardy-Weinberg principle also finds application in studies of industri- al melanism, one of the most famous and best-studied cases of evolution in action (Kettlewell 1973). Industrial melanism refers to the evolution of black (melanic) color patterns in several species of moths that accompanied progres- sive pollution of the environment by coal soof during the industrial revolution. 86 Chapter 3 (The various color forms of the moths are known as nnorphs.) The evolution of melanism has been observed in Great Britain, West Germany, Eastern Europe, the United States, and in other heavily industrialized areas. The species that evolve melanism are typically large moths that fly by night and rest in a sort of cataleptic state by day, often on the trunks of trees, using their cryptic black- and-white mottled color pattern for concealment from visually cued predators such as hedge sparrows, redstarts, and robins (Figure 3.4). Of nearly 8Q0 species of large moths in the British Isles, where industrial melanism has been most intensively studied, about 100 species are industrial melanics (Bishop and Cook 1975) The best known of these are the peppered moth (Btskm bctularin) and the scalloped hazel moth (Gonodantis bidentata). In most instances, the melanic color pattern has been found to be due to a single dominant allele. PROBLEM 3,6 In one study of a heavily polluted area near Birm- ingham, England, Kettlewell (1956) observed a frequency of 87% melanic Biston betularia. Estimate the frequency of the dominant allele leading to melanism in this population and the frequency of melan- ics that are heterozygous. Figure 3.4 Melanic and nonmelanic moths, showing camouflage of light moths on light background and dark moths on dark. (Photograph by H B D. Kettlewell.) Organization of Genetic Variation 87 ANSWER The observed frequency of homozygous recessives is R = 0,13, a nd so the frequency of recessive allele is estimated as q - V(0.13) = 0.36. Assuming random mating, the expected frequencies of dominant homozygotes, heterozygotes, and recessive homozygotes are 0.41, 0.46, and 0.13, respectively. The proportion of melanics that are heterozygous is 0.46/0.87 = 52,9%. Frequency of Heterozygotes The Hardy -Weinberg principle also has important implications for the fre- quency of heterozygotes carrying rare recessive alleles. The graphs in Figure 3.5 depict the frequencies of AA, Aa, and an in a population in HWE. The het- erozygotes are most frequent when the allele frequencies are 0.5 Suppose thai the allele a is a recessive, and consider the curves as the allele frequency of a goes toward 0. As a becomes rare, the frequencies of recessive homozy- gotes and heterozygotes both decrease, but the frequency of the recessive homozygote is much lower. As the frequency of /> goes to 0, the frequency of recessive homozygotes goes to at a rate of q 2 , whereas the frequency of het- erozygotes goes to at a rate of 2pq. The result is that the ratio of heterozy- Frequency of A alli'le 0.6 4 4 Oft Frequency of" i? allele Figure 3.5 Frequencies ol AA, Aa, and m genotypes with HWE. Note th.it, as either allele becomes more rare, the frequency of homozygotes for thai allele is much lower than the frequency of heterozygotes 88 Chapter 3 gotcs to recessive homozygotes increases wilhoul limit as the recessive allele becomes rare To illustrate the principle, suppose q - 0.10; then 2pq/q 2 = 18, meaning that there are 18 times as many heterozygotes as recessive homozygotcs For q = 0.01, to take a more extreme example, the ratio is 198; and for q = 0,001, the ratio is 1998. These examples demonstrate that when a recessive allele is rare, most genotypes containing the rare allele are heterozygous. Quantitatively, the ratio of heterozygotes to homozygotes equals 2pq/q 2 - 21 q - 2 which, for small q, is approximately 2/q. Consequently, the excess of heterozygotes over homozygotes becomes progressively greater as the reces- sive allele becomes more rare. To take a real example, consider cystic fibro- sis, an autosomal-recessive defect in chloride transport characterized by abnormal glandular secretions, impaired digestion, frequent respiratory infections, and other serious symptoms. The frequency of the homozygous recessive genotype in newb orn Caucasians is approximately 1 in 1700. For this allele, q = V(l/1700) = 0.024. Assuming random mating, the fre- quency of heterozygotes is estimated as 2(0.024)(1 - 0.024) = 0.047, or about 1 in 2 1 . In other words, although only 1 person in 1 700 is actually affected with cystic fibrosis, 1 person in 21 is a heterozygous carrier of the harmful allele. PROBLEM 3.7 Phenylketonuria is a defect in phenylalanine meta1> olisrn caused by lack of a functioning allele. Over 200 defective alle- les have been identified and most affected individuals are actually heterozygous for two different defective alleles. The condition affects about 1 in 10,000 newborn Caucasians. Estimate the frequency of het- erozygotes for the normal and a defective allele under Ihe assumption of random mating. ANSWER About 1 person in 50 carries a defective allele. SPECIAL CASES OF RANDOM MATING In this section we extend the Hardy-Weinberg principle to multiple alleles and to genes located on the X chromosome. Three or More Alleles Genotype frequencies undpr random mating for genes with three alleles are shown in Figure 3.6. Here it is convenient to label the alleles as A,, A 2 , and A 3 MMv A } FrotjiK'ncy f. Allele Frequency Female a 2 gametes Organization of Genelic Variation 89 Pr A t A 2 Mi A 2 Ay Ihlh A 2 A 2 p\ A 2 A, /Mi A,A 2 fhPi Figure 3.6 Cross-multiplication square showing Hardy-Weinberg frequence for three autosomal alleles. and the corresponding allele frequencies as p u p 2 , and p v Because there are only three alleles, p, + p 2 + p , = 1. With three alleles there are six diploid geno- types, and under random mating their expected frequencies are as follows- A X A 2 A 2 A 2 AyA, A 2 A, ?\ 2pip 2 pi 2p?pi Pi These frequencies can be obtained by expanding (p, A x + p 2 A z + p 3 A 3 ) z , which the cross-multiplication square in Figure 3.6 does automatically. Application of Figure 3.6 can be illustrated with the familiar ABO blood groups in humans. The ABO blood groups are controlled by three alleles des- ignated I , V\ and f. Genotypes J V and I A J° have blood type A; genotypes 90 Chapter 3 1 B I B and J H !° have blood type B, genotype l°I° has blood type O, and geno- type fV has blood type AB. In one test of 6313 Caucasians in Iowa City, the number of people with blood types A, B, O, and AB was found to be 2625, 570, 2892, and 226, respectively {Mourant et al. 1976). The best estimates of allele frequency in this case are p, = 0.2593 (for l A ), p 2 = 0.0625 (for l B ), and p 3 ~ 0.6755 (for 1°). (Estimation of allele frequencies for the ABO blood groups is complicated because of dominance; for methods see Cavalli-Sforza and Bodmer 1971 and Vogel and Motulsky 1986 ) The expected (and observed) numbers of the four blood-type phenotypes are therefore: (0.2593 2 + 2 x 0.2593 x 0.6755) x 631 3 = 2636.0 (observed 2625) A: B- O: (0.0652 2 + 2 x 0.0652 x 0.6755) x 6313 = 582.9 0.6755 2 x6313 = 2880.6 AB: (2 x 0.2593 x 0.0652) x 6313 = 213.5 (observed 570) (observed 2892) (observed 226) The x 2 for goodness of fit to Hardy-Weinberg proportions is 1 .11. There is one degree of freedom for this test: 4 (to start with) - 1 (for fixing the total at 6313) - 1 (for estimating j&] from the data) - 1 (for estimating p 2 from the data); a degree of freedom is not deducted for estimating p s because p 3 = 1 - Pi ~ Pi- P° r a X* °f 1 -11 with one degree of freedom, the associated probabili- ty from Figure 3.3 is about 0.30, and so the Iowa City population gives no evi- dence against Hardy-Weinberg proportions for this gene. PROBLEM 3.8 In a sample of 1617 Spanish Basques, the numbers of A, B, O, and AB blood types observed were 724, 110, 763, and 20, respectively (Mourant et al. 1976), The best estimates of allele fre- quency arepj = 0.2661 (for I A ),p 2 = 0.0411 (for 1% andp 3 =* 0.6928 (for r). Calculate the expected numbers of the four phenotypes and carry out a % 2 test for goodness of fit to the Hardy-Weinberg expectations. ANSWER The expected numbers of A, B, O, and AB are 710.7, 94.8, 776.1, and 35.4, respectively. The % 2 equals 9.61 with one degree of freedom, for which the corresponding probability is 0.0025. Because a deviation as large or larger than that observed would be expected by chance in only 0.0025 samples (that is, about 1 in 400), there is very good reason to reject the hypothesis that the genotypes are in Hardy- Weinberg proportions in this population. The reason for the discrep- Qrganization of Genetic Variation 91 ancy is not known. One likely possibility is migration into the popu- lation by people with allele frequencies that are significantly different from those among the Basques themselves. PROBLEM 3.9 Among many aboriginal American Indian tribes, the allele frequency of f B is extremely low. For example, a sample of 600 Papago Indians from Arizona included 37 A and 563 O blood types (Mourant et al. 1976). What are the best estimates of the allele fre- quencies of I A , I B , and 1° m mis population, and what are the expect- ed genotype frequencies assurning random mating? ANSWER There are no I B alleles in the sample, so the best estimate of pj is 0. Thus, there are only two alleles l A and 1° with l A dominant. The best e stimate of p 3 is thus obtained from Equation 3.4 as V(S63/600) * 0.9687 and that of p x as 1 - p 3 = 0.0313. The expected genotype frequencies are 0.0313 2 = 0.0010 for I A 1 A , 2(0.0313)(0.9687) = 0.0606 for I A P, and 0.9687 2 = 0.9384 for I°I°. In general, if there are n alleles A V A 2 A„ with respective frequencies V\. P* ■Pn (and Pi + p 2 + ■ • ■ + p„ = 1), then the genotype frequencies expected under random mating are 2 P,P, for A,A, homozygotes for A,Aj heterozygotes 3.5 Equation 3.5 may be applied to data on allozyme polymorphisms in Dwsophila pcrsimilis in California. One sample of 108 adult flies from the Fish Creek population included four alleles of the gene Xdlt, which codes for 92 Chapter 3 xanthine dehydiogenase We mav call the alleles Xdh-1 , Xdlt-2, Xdh-3, and Xdti-4; thoi r respective frequencies were estimated .is {> , = 08, f) 2 = 0.21 , p 3 = 62, and p,= 0.09 (Prakash W7) With four alleles, there are four possible homozygotes (for example, Xdh-l/Xdh-1) and six possible heterozygotes (for example, Xdh-1 /Xdh-2). In a random- ma ting population, the frequency of any homozygous genotype is expected to be the square of (he corresponding allele frequency. For example, the frequency of Xdh-1 f Xdh-1 is expected to be ;?i 2 , and the frequency of any heterozygous genotype is expected to be two times the product of the corresponding allele frequencies. For example, the frequency of Xdh-l/Xdh-2 is expected to be 2p i p z . The Hardy-Weinberg fre- quencies for all 10 possible genotypes can be obtained by expanding the expression (0 08 Xdh-1 + 0.21 Xdh-2 + 0.62 Xdli-3 + 0.09 Xdh-4) 1 . PROBLEM 3.10 Four alleles of the gene Adh coding for alcohol dehydrogenase were found in a Texas population of Phlox cuspidata (Levin 1978). The alleles may be designated Adh-1, Adh-2, Adh-3, and Adh-4. Their frequencies were estimated as 0.11, 0.84, 0.01, and 0.04, respectively. What are the expected Hardy-Weinberg proportions of the 10 genotypes? ANSWER Adh-l/Adh-1: 0.11 2 = 0.0121; Adh-l/Adh-2: 2{0.11)(0.84) = 0.1848; Adh-VAdh-2 = 0.84 2 = 0.7056; Adh-l/Adh-3 = 2(0.11)(0.01) = 0.0022; Adk-2/Adh-3 = 2(0.84)(0.01) = 0.0168; Adh-3/Adh-3 = 0.01 2 = 0.0001; Adh-l/Adk-4 = 2(0.11){0.04) = 0.0088; Adh-2/Adh-4 = 2(0.84)(0.04) = 0.0672; Adh-3/Adlt-4 = 2(0.01)(0.04) = 0.0008; Adh-4/Adh- 4 = 0.04 2 = 0.0016. It should be pointed out that the observed genotype frequencies were nowhere near the Hardy-Weinberg expectations because Phlox cuspidata undergoes a substantial frequency of self- fertilization (about 78%), which violates the assumption of random mating. How to deal with such departures from random mating is discussed in Chapter 4. X-Unked Genet An important exception to the rule that diploid organisms contain two alleles of every gene applies to genes on the X and Y chromosomes. In mammals and many insects, females have two copies of the X chromosome whereas males have one X chromosome and one Y chromosome. The X and Y Organization of Genetic Variation 93 chromosomes segregate, and so half the sperm from a male carry the X chro- mosome and half carry the Y chromosome Although the Y chromosome car- ries very few genes other than those involved in the determination of sex and male fertility, the X chromosome carries as full a complement of genes as any other chromosome. Genes on the X chromosome are called X-linked genes, and the important consequence of X linkage is that a recessive allele on the X chromosome in a male is expressed phenotypkally because the Y chromosome lacks any compensating allele. For X-linked genes with two alleles, therefore, there are three female genotypes (A/1, An, and an) but only two male genotypes (A and a). The consequences ol random mating with two X-linked alleles are shown in Figure 3.7, where the alleles are denoted X' 1 and X". Note that in females, which have two X chromosomes, the genotype frequencies are as given by the Hardy-Weinberg principle in Equation 3.1; in males, which have only one Male gametes X-bearing Allele X A Frequency p V-beanng Female gametes Allele Frequency X* p X" a X A X A P 2 X A X" )"1 X*Y X°X A qp X°X a |J 2 Summed frequencies an zygotes Females Males X A X A - p 2 X A Y' p X A X a 2p<j X"Y q X"X" <f Figure 3.7 Consequences of random mating with X-linked genes Genotype frequencies in females equal the Hardy-Weinberg frequencies, and genotype frequencies in males equal the allele frequencies 94 Chapter 3 Organization of Genetic Variation 95 X chromosome, the genotype frequencies are equal to the allele frequencies. The calculations in Figure 3.7 are valid only if the allele frequencies are iden- tical in eggs and sperm. When they differ, approximate equality of allele fre- quencies in the sexes is usually attained for X-linked genes in a period of 10 or so generations of random mating because, in each generation, any allele frequency in female zygotes is the average of the frequency of the allele in male and female parents in the previous generation. PROBLEM 3. 1 1 The human Xg blood group is controlled by an X- linked gene with two alleles, designated Xg" and Xg. Two phenotypes can be distinguished by means of the appropriate antisera, Xg(a+) and Xg(a-). Xg* is dominant to Xg, and so females of genotype Xg"/Xg a and Xg a /Xg have blood type Xg(a+), whereas females of genotype Xg/Xg are phenotypically Xg(a-). Males of genotype Xg* have blood type Xg(a+); those of genotype Xg have blood type Xg(a-). In a sam- ple of 2082 British people, there were 967 Xg(a+) females, 667 Xg(a+) males, 102 Xg(a-) females, and 346 Xg(a~) males (Race and Sanger 1975). The best estimates of allele frequency are p = 0.675 (for Xg*) and q = 0.325 (for Xg). Calculate the expected numbers in the four phenotypic classes, assuming random-mating proportions, and carry out a i test for goodness of fit. (The number of degrees of freedom in this case is 1: there are four degrees of freedom to start with; one must be deducted for using the observed number of males in calculating the expectations for males; one must be deducted for using the observed number of females in calculating their expectations; and one more must be deducted for estimating p from the data.) ANSWER The expected numbers of Xg(a-f) and Xg(a-) males are 0.675 x 1013 = 683.8 and 0.325 x 1013 = 329.2, respectively. The expect- ed numbers of Xg(a+) and Xg(a-) females are [0.675* + 2(0.675)(0325)] x 1069 = 956.1 and 0.325* x 1069 = 112.9, respectively. The % 2 equals 2 45 which, as noted above, has one degree of freedom. The associat- ed probability is about 0.12 (Figure 3.3), and so there is no reason to reject the hypothesis of random-mating proportions. One of the important features of random mating for X-linked genes is that phenotypes resulting from a recessive allele will be more common in males than in females In Problem 3.11, for example, the proportion of Xg(a-) males is 346/1013 = 34%, whereas the proportion of Xg(a-) females is only 102/ 1069 = 10%. There is always an excess of affected males because q (which equals the proportion of males with the recessive phenotype) will always be greater than q (which is the proportion of females with the recessive pheno- type). Indeed, the discrepancy grows larger as the recessive allele becomes more rare. For example, with the X-linked "green" 1ype of color blindness, q = 0.05 in Western Europeans, and so the ratio of affected males to affected females is q/q 2 = l/q = 1/0.05 = 20. In contrast, for the X-linked "red" type of color blindness, q = 0.01 and so, in this case, the ratio of affected males to affected females is 1 /0.0] = 100. PROBLEM 3. 1 1 California populations of Drosophik persimilis have two alleles of art X-linked gene coding for aUozymes of phosphoglu- comutase-1 (Policansky and Zouros 1977). The alleles may be desig- nated Pgm-1 A a*td Pgm-1®; their estimates frequencies were 0.25 and 0.75, respectively. AsSurhingjrandom-niating proportions, what are the expected genotype frequencies in males and females? L ANSWER In males, Ppn-l A at 0,25 sundPgm-1 B at 0.75. In females, Pgm-l A fPgm-l A at 0.25 r = 0.0625; >gm-l A f Pgm- J B at 2(0.25)(0.75) = 0.3750; Pgm-f/Pgm-l B at 0.75* * 0.5625. Before leaving the subject of X-linkage, it is necessary to point out that certain species— among them, birds, moths, and butterflies— have the sex- chromosome situation backwards. In these species, females are XYand males XX. The consequences of random mating are the same as otherwise, except that the sexes are reversed. LINKAGE AND LINKAGE DISEQUILIBRIUM With random mating, the alleles of any gene are combined at random into genotypes according to frequencies given by the Hardy- Weinberg propor- tions. To be specific, imagine a gene with two alleles, call them' A } and A 2 , at frequencies p, and p 2 , respectively, where p } + p 2 = L Then the Hardy- Weinberg principle tells us thai genotypes A X A U A,A 2 , and A 2 A 2 are expected 'n the proportions/??, 2p t p 2f and pi respectively, provided that mating is random. Similarly, we may consider a different gene with alleles 8, and B 2 at fre- quencies q, and q 2 , respectively, where ij, + q 2 = ] . Then the Hardy-Weinheig principle tells us again that the genotype frequencies of K,B,, H|fl 2 , and B 2 B 2 96 Chapter 3 arc expected in the proportions tf { , 2(\\qir a ^^ f ji respectively, provided that rndtin^ is random. Thus, the A { allele is in random association with the A 2 allele, rind the B { allele is in random association with the B 2 allele. Strange as it may seem, the alleles of the A gene may nevertheless fail to he in random association with the alleles of the B gene. The precise meaning of "random association" is illustrated in Figure 3 8. In this figure the squares refer to the alleles present in gametes, not to genotypes as in earlier diagrams. When the alleles of the genes are in random association, the frequency of a gamete car- rying any particular comhination of alleles equals the product of the fre- quencies of those alleles. Genes that are in random association are said to be in a state of linkage equilibrium, and genes not in random association are said to be in linkage disequilibrium With linkage equilibrium, therefore, the gametic frequencies are: A,B X A,B 2 A 2 B, A,Br, Fl x 4l Pi x Hz Pi x % PlX<Jl 3.6 With random mating and the other simplifying assumptions listed earlier (including a large population with no mutation, migration, or selection), link- age equilibrium between genes is eventually attained. However, linkage equilibrium is attained gradually, and the rate of approach can be very slow. The slow approach to linkage equilibrium stands in contrast to the attain- ment of HWE with alleles of a single gene, which typically requires just one generation (when generations are nonoverlappirtg) or a relatively small num- ber of generations (when generations are overlapping). The rate of approach to linkage equilibrium depends on the rate of recom- bination in genotypes heterozygous for both genes. There are two types of double heterozygotes: A X BJA 2 B 2 A ] B 2 /A 2 B l In the first case, the genotype was formed by the union of an AjB t gamete with an A 2 B 2 gamete. In the second case, the genotype was formed by the union of an A } B 2 gamete with an A 2 B { gameie. For the moment, consider the genotype A i B 1 /A 2 B 2 . The gametes produced by this genotype are of four types. (1) AiB u (2) A 2 B 2 , (3) A\B 2 , and (4) A 2 B } . Gametic types 1 and 2 are known as nonrecombinant gametes because the alleles are associated in the same manner as in the previous generation (specifically, 4, with B, and A 2 with Bi). Gametic types 3 and 4 are known as recombinant gametes because Organization of Genetic Variation 97 Alleles ol A gene Allele A } Frequency p } Allele Frequency Alleles of B gene yt.fl, fill A 2 B, A 2^1 Pill Figure 3.8 Random association between two alleles of each of two genes, showing expected gametic frequencies when the alleles are in linkage equilibrium. the alleles are associated differently than in the previous generation (specifi- cally. Ay with B 2 and A 2 with J?,). Because of Mendelian segregation, the frequency of gametic type 1 equals that of type 2, and the frequency of gametic type 3 equals that of type 4. That is, the two nonrecombinant gametes are formed in equal frequencies, and the two recombinant gametes are formed in equal frequencies. However, the overall frequency of recombinant gametes (type 3 + type 4) does not neces- sarily equal the overall frequency of nonrecombinant gametes (type 1 + type 2) except in special cases. The term recombination fraction, usually symbol- ized r, refers to the proportion of recombinant gametes produced by a double heterozygote. Suppose, for example, that the genotype A ,6,/ A 2 B 2 produces gametes A } B U A 2 B 2 , A^B 2t and A 2 B y in the proportions 0.38, 0.38, 12, and 0.12, respectively Then the recombination fraction between the genes is r = 0.12 + 0.12 = 0.24. The recombination fraction between genes depends on whether they are present on the same chromosome and, if so, on the physical distance between them. For genes on different chromosomes, the recombination fraction is r = 0.5 because the four possible gametic types are produced in equal frequency For genes on the same chromosome, the recombination fraction depends on their distance apart, because each chromosome aligns side-by-side with its partner chromosome in meiosis and can undergo a sort of breakage and 98 Chapter 3 reunion resulting in an exchange of parts between the partner chromosomes. The closer two genes are, the less likely that a breakage and reunion takes place in the region between the genes, the farther apart two genes are, the more likely such an event becomes. The smallest possible recombination frac- tion is r = 0, which would imply thai the two genes are so close together that a break never takes place between them. The largest possible recombination fraction is r = 0.5, which is found when genes are very far apart on the same chromosome or, as noted above, when they are on different chromosomes. Genes for which the recombination fraction is less than 05 must necessarily be on the same chromosome, and such genes are said to be linked. To sum up, if the recombination fraction between the A and B genes is denot- ed r, then the genotype A , B, / A 2 B 2 produces the following types of gametes: A]R] with frequency (1 - r)/2 A 2 B 2 with frequency (1 - r)/2 /1,B 2 with frequency r/2 A 2 B\ with frequency r/2 The situation in A]B 2 /A y B 2 genotype is much the same, but there is one important difference. In this case, the /t,B, and A 2 B 2 gametes are the recombi- nant /i/pps, and the A]B 2 and A 2 B\ gametes are the nomccomhmant types. Thus, the genotype A , B 2 M, #2 produces the following types of gametes: A X B { with frequency r/2 A 2 B 2 with frequency r/2 A y B 2 with frequency (1 - r)/2 A 2 B { with frequency (1 - r)/l PROBLEM 3,1 3 The genes for the human MN and Ss blood groups discussed in Problem 3.4 are close together on the same chromosome. Suppose that the recombination fraction between the genes is r = 0.01. What types and frequencies of gametes would be produced by a per- son of genotype MS /Ns? By a perron of genotype Ms/NS? ANSWER The MS/Ns genotype produces gametic types MS, Ns, Ms, and NS in proportions (1 - 0.01 )/2 = 0.495, (1 - 0.01)/2 = 0.495, Organization of Genetic Variation 99 0.01/2 * 0,005, and 0.01/2 = 0.005, respectively. The Ms/NS genotype produces exactly the same gantetic types, but their frequencies are 0.005, 0.005, 0.495, and 0.495, respectively. The recombination fraction between genes is important in population genetics because it governs the rate of approach to linkage equilibrium To be precise, consider a population in which the actual frequencies of the chromo- some types among gametes are as follows: A X B X : P n Afe P ]2 A 2 Bv P 2 , A 2 B 2 : P 22 where Pu + P, 2 + P 2] + P 22 = 1. In terms of the gametic frequencies, linkage equilibrium is defined as the state in which P n = p,<Ji, P l2 = P\(j 2 , P 2 1 = Pity, and P z2 = p 2 q 2 (see Figure 3.8). Suppose that the genes are not in linkage equilibrium. To determine how rapidly linkage equilibrium is approached, we need to deduce the gametic fre- quencies in the next generation. Consider first the ^4,6, gamete. In any one generation, a chromosome carrying A& either cou Id have undergone recom- bination between the genes (an event with probabi h ty r, where r is the recom- bination fraction), or could have escaped recombination between the genes (an event with probability 1 - r). Among the/1,5, chromosomes that did not undergo recombination, the frequency of AB, is I he same as it was in the pre- vious generation; among the chromosomes that did undergo recombination, the frequency of ^,8, chromosomes is simply the frequency of -B,/^,- geno- types in the previous generation, where the dash in place of the A and B allele means that the identity of that particular allele is irrelevant. Because mating is random, the overall frequency of -B\/A x - genotypes is />,</,. Putting all the steps in the argument together, the frequency of A , B, in any generation, call it Pn' is related to the frequency P n in the previous generation by the equation P n '= (1 - r) x P n [for the nonrecombinants] + rxpiqi [for the recombinants) Subtraction of p : q } from both sides leads to P]r'-Pi'Ji=<l-'')(/ , n-pii7)) 3 7 Hfi-^M TOO Chapter 3 Equation 3.7 becomes simplified somewhat by defining D as the differ- ence Pn - p\(\\. Then D„ is Ihe value of D in the nth generation, and Equation 3.7 implies that D„ = (1 - r)D„_, The solution of this equation is found by suc- cessive substitution as D„ = (1 ~ r)D,^ = (1 - r) 2 D„_ 2 = •■=(!- r)" D fl 38 where D„ is the value of D in the founding population Because 1 - r < 1, (1 - r)" goes to zero as n becomes large, but how rapidly (1 - r)" goes to zero depends on r; the closer r is to zero, the slower Ihe rate This principle is illus- trated in Figure 3.9 Recall here thai r - 0.5 corresponds either to genes far apart in the same chromosome or to genes in different chromosomes. Because (1 - r)" goes to zero, D goes to zero, and therefore P u goes to p x q\ unless there are other offsetting processes. Analogous arguments hold for gametes containing A X B 2 , A 2 B U or A 2 B 2 , and so P 12 , P 21 , and P 22 go topifc, p 2 (Ju and p 7 (j 2 , respectively. Thus, linkage equilibrium is attained at a rate deter- mined by the value of r. Frequency of - recombination, r Time (in generations) Figure 3.9 Linkage disequilibrium between genes gradually disappears when mating is random, provided there is no countervailing force building ii up. The rate of approach to linkage equilibrium depends on the recombination frequen- cy between the genes. The disappearance of linkage disequilibrium is gradual even with free recombination (r = '/ 2 ). ln these examples, the frequencies of both alleles at both loci equal V 2 , and the initial linkage disequilibrium is either at its maximum (D = 25) or minimum (D = - 0.25) value, given these allele frequen- cies. Organization of Genetic Variation 101 The value of D that holds for P M -p,<7, also holds for the other possible gametes, as follows Pn = /Vfi + D Pn = Piq?-D P 22 = p 2 q 2 + D The quantity D is often called the linkage disequilibrium parameter. In terms of the gametic frequencies, D can be shown to satisfy D = P„P 22 -P, 2 P 21 3.9 With random mating and no countervailing forces, the value of D changes according to Equation 3.8, and D = corresponds to linkage equilibrium Fur- thermore, P lt , P 12 , P 21 , and P 23 must all be nonnegative and so, for any pre- scribed allele frequencies p u p lt q u and q 2 , the smallest possible (D mm ) and largest possible (D m<K ) values of D are as follows D mn = the larger of -p x q { and -p 2 q 2 D max - the smaller of p x q 2 and p 2 q x 3.10 In studies of linkage disequilibrium, estimation of the gametic frequen- cies P,], P 12 , P 21 , and P 22 usually requires complex statistical procedures rather than straightforward chromosome-counting methods because there are 10 genotypes but usually no more than nine phenotypes. (There are 10 geno- types because A A BJA 2 E 1 and A\B 2 /A 2 B y must be distinguished.) An example of linkage disequilibrium is found in the genes controlling the MN and Ss blood groups in human populations. Earlier in this chapter, we cited data from 1000 Britishers with respect to the MN blood groups and showed that the genotypes MM, MN, and NN are in Hardy- Weinberg pro- portions. In Problem 3.4, data from the same 1000 people were analyzed with respect to the Ss blood groups, and genotypes SS, Ss, and ss were also found to satisfy the Hardy-Weinberg proportions. In order to discuss linkage dise- quilibrium between the genes, it will be convenient to use the symbols p ] and p 2 for the allele frequencies of M and N, respectively, and the symbols q x and q 2 for the allele frequencies of S and s, respectively. The earlier analyses yield- ed estimates of p, = 0.5425 and p 2 = 0.4575 for M and N and ft = 0.3080 and <}i = 0.6920 for S and s. Were the loci in linkage equilibrium, the gametic fre- quencies would be ptf , for MS, p t q 2 for Ms, p 2 q x for NS, and p 2 q 2 for Ns. There- fore, among the 1000 genotypes (a total of 2000 chromosomes), the expected numbers are as shown in the third column below (the second column gives the observed numbers): 102 Chapter 3 Organization of Genetic Variation 1 03 MS 474 5425 x 0.3080 x 2000 - 334.2 Ms 61 1 0.5425 x 0.6920 x 2000 = 750 8 NS 1 42 0.4575 x 0.3080 x 2000 = 281 .8 Ns 773 4575 x 0.6920 x 2000 = 633.2 The x 2 for goodness of fit is 184.7 with one degree of freedom: 4 (to start W i t h) _t-i (for estimating p, from the data) - 1 (for estimating q { from the data) = L The associated probahility is so small as to be off the chart in Figure 3.3, and consequently it is very much less than 0.0001. This result means that chance alone would produce a fit as poor or poorer substantially less than one time in 10,000, and so the hypothesis that the loci are in linkage equilib- rium can confidently be rejected. To quantify the amount of linkage disequilibrium, we must estimate the gametic frequencies P xu P [2l P 2] , and P 12 : MS: P n = 474/2000 = 2370 Ms: P„ = 61 1/2000 = 0.3055 NS: P 2l = 142/2000 = 0.0710 Ns: P n = 773/2000 = 0.3865 Thus, D can be estimated as D = P„ P 22 - P 12 P 2 , = 0.07. From Equation 3 i 0, D aliK is given by p % q 2 or p 2 q h whichever is smaller; in this case^tfc =.038 and p 2 q, = 014, hence D m; „ = 0.14. Therefore, D/D m:iK = 0.07/0.14 = 50%, and so we conclude that the amount of disequilibrium between the genes controlling the MN and Ss blood groups is about 50% of its theoretical maxi- mum. In most local populations of sexual organisms that regularly avoid extreme inbreeding (mating between relatives) values of D are typically zero or close to zero (indicating linkage equilibrium) unless the genes are very closely linked. This overall conclusion is exemplified in the following problems. analysis to determine whether there is linkage disequilibrium between E6 and EC. If there is linkage disequilibrium, what is its magnitude relative to the theoretical maximum (or minimum) value? ANSWER fW the data given in Problem 2.3, the observed numbers of the four chromosomal types £6 F EC F , £6 f EC S , E6 S EC F , and E6 S EC S were 159, 16, 277, and 37, respectively. The estimated allele frequen- cies of £6 F , E6 S , EC f , and EC are 0.3579, 0.6421, 0.8916, and 0.1084, respectively. Assuming linkage equilibrium, the expected numbers of the four chromosomal types are 156.0, 19.0, 280.0, and 34.0, respec- tively. The x 2 value with one degree of freedom is 0.828, for which the associated probability is about 0.4, Thus, there is no reason to reject the hypothesis that £6 and EC are in linkage equilibrium in this experimental population. PROBLEM 3.15 Carry out an analysis of linkage disequilibrium for the genes EC and Odh, using the data in Problem 2.3 (page 47). A con- venient shortcut to obtaining the x 2 value is first to calculate by substituting, on the right-hand side, the estimated values for each of the parameters. The value of % 2 is numerically equal to p 2 N, where N is the total number of chromosomes examined. The biological meaning of p is that it is the correlation between alleles present in the same chromosome. PROBLEM 3.14 In Drosophik meknogaster, the genes E6-EC-Odh are linked in chromosome 3. The E6 and EC genes are rather loosely linked (r = 0.122), whereas EC and Odh are tightly linked (r = 0.002). The recombination fractions are those in females, as recombination does not take place in males of this species. Using the data from the experimental population given in Problem 2.3 (page 47), carry out an ANSWER For the data given in Problem 2.3, the observed numbers of the chromosomal types EC P Odh v , EC F Odh s , EC S Odh F , and £C S Qdh s were 416, 20, 44, and 9, respectively. The estimated allele fre- quencies of EC p f EC s t Odh F , and Odh s are 0.8916, 0,1084, 0.9407, and 0.0593, respectively, and D = (416 x 9 - 20 x 44)/489 a = 0.0120. Thus, P = 0.0120/(0.8916 x 0.1084 x 0.9407 x 0.0593) 1/2 = 0.1631. Conse- 108 Chapter 3 approximately 1 - 2<j, 2q, and when q is so small that q 2 is approxi- mately 0. 8. In a population undergoing random mating for a single gene with a dom- inant and recessive allele, show that the allele frequency of the recessive allele among individuals with the dominant phenotype is q/{\ + q), where q is the allele frequency of the recessive in the whole population. 9. The frequency of one form of recessive X-linked color blindness is 5% among European males. What is the expected frequency of this form of color blindness among females? What fraction of females would be het- erozygous carriers? 10. For a trait due to a rare X-linked recessive gene, show that the frequency of carrier females is approximately equal to two times the frequency of affected males. 11. What is the analogue of the Hardy- Weinberg principle for a gene with two alleles in a tetraploid? 12. Given the following table of allele frequencies: Gene J 2 3 4 5 Allele 1 63 094 0.995 1.0 0.78 Allele 2 37 0.06 0.005 - 0.12 Allele 3 - - - - 0.06 Allele 4 - - - - 0.04 What is the proportion (P) of polymorphic genes (using the definition in the text)? Assuming random mating and linkage equilibrium, what is the average heterozygosity (H) for the set of genes? 13. Charles Darwin could have discovered segregation had he known what to look for, as Mendelian segregation occurred in at least one of his own experiments. Darwin (cited in litis 1932) studied flower shape in the snap- dragon Antirrhinum. In a cross between a true-breeding strain with regu- lar (peloric) flowers and a true-breeding strain with irregular (normal) flowers, all of the F,'s were normal Crosses of Fj x F t yielded 88 normal and 37 peloric plants. Perform a % 2 test assuming a 3 : 1 ratio in the F 2 . Is the peloric or normal allele dominant? 14. For a mating between triple dominant/recessive heterozygotes of three unlinked genes, there are eight phenotypic classes among the offspring. What are the expected phenotypic ratios? Mendel carried out such an experiment and obtained the phenotypic ratio 269 : 98 : 86 : 88 : 30 : 34 : 27 : 7 among a total ol 639 progeny. (He complained that this experiment required the most time and effort of any of his crosses.) Calculate the x 2 and associated probability. Organization of Genetic Variation 1 09 15. If one gene has alleles /I, and A 2 at frequon u. ,,, , ind ,,„ , ind anothpr gene has alleles B u B 2/ and fl, at frequence ,,„ ,,„ tind ,, wh.it are the expected frequencies of gametes with linkage eqmhbrmm .issuming'that p r = 3, q, = 0.2, and q 2 = 0.3? h 16. For two genes with alleles A, and A, and 0, and « 2 , respectively, w.th p, ^ly l X e ^T es oMl and A * and *■ md * lh - " f «■ - nd ^ a. What are the frequencies of all possible gametes assuming linkage equilibrium? h 6 b. What are the frequencies of all possible gametes ,f there is lmkage dis- equJibnum with D equal to 50% of its theoietiwl maximum? 17 Use the result in Problem 8 to show that the frequency of homozygous recessive genotypes from dominant x dominant malings is fo/(l + q )f and from dominant x recessive matings i S(J /(l f q). Note that the latter is equal to the square root of the former. (These proportions are called Snyder s ratios and were once used to test traits for simple recessive inheritance.) 1 06 Chapter 3 Organization of Genetic Variation 1 07 mixture, but there is substantial linkage disequilibrium between the alleles, as shown a t the bottom of the table In the mixed population, D equals 81% of its theoretical maximum value. The sole cause of the disequilibrium is the differing allele frequencies in the subpopulations. Furthermore, the consider- ations in Table 3.2 make no assumption that A and B are on the same chro- mosome, hence linkage disequilibrium may result from population admixture even for genes on different chromosomes. If subpopulations become perma- nently mixed and undergo random mating, then Equation 3.8 implies that the induced linkage disequilibrium is expected to decrease at the rate r per generation, where r is the recombination fraction between the A and B genes. For unlinked genes, r = x / 2 SUMMARY In any population, the genotype frequencies among zygotes are determined in large part by the patterns in which genotypes of the previous generation come together to form mating pairs. In random mating, genotypes form mat- ing pairs in the proportions expected from random collisions. For a gene with two alleles A and a in a random-mating population, the expected geno- type frequencies of AA, Aa, and aa are given by p 2 , 2pq, and q 2 , respectively, where p and q are the allele frequencies of A and a, respectively, with p + q-l. The expected genotype frequencies with random mating constitute the Hardy-Weinberg equilibrium (HWE). The rate at which the HWE frequencies are attained depends on the life history of the organism. In an organism with nonoverlapping generations, such as an annual plant, each generation is sep- arated in time from the preceding and the following generation; in this case, the Hardy-Weinberg frequencies are attained in one generation of random mating provided that the allele frequencies are equal in the sexes. In an organism with nonoverlapping generations, the approach to HWE is gradual. Slatistical tests of HWE are often based on the j£ 2 test, but this test is relative- ly weak in detecting departures from the expected frequencies, especially those caused by admixture of subpopulations differing in allele frequency. One of the principal implications of the HWE is that the allele frequencies and the genotype frequencies remain constant from generation to generation, hence genetic variation is maintained. Another major implication is that, when nn allele is rare, the population contains many more heterozygotes for the allele than it contains homozygotes for the allele. Extensions of the HWE include multiple alleles and X-linked genes. With multiple alleles, the expected frequency of a homozygous genotype A,A, equals pf, and the expected frequency of a heterozygous genotype A,A t equals 2p,p r where p, and p, are the allele frequencies of A, and A r With X-linked alle- les, the genotype frequencies in females (XX) are given by the HWE but those in males (XY) are given by the allele frequencies. Consequently, for n recessive X-linked mutation with allele frequency q, the proportion of affected males (n) always exceeds the proportion of affected females (q 1 ); the rarer the recessive allele, the greater is the excess of affected males, Nonrandom association between the alleles of different genes is measured by the linkage disequilibrium parameter D. Random association between alleles of different genes is called linkage equilibrium, and it is indicated by D = When D * 0, the alleles are said to be in linkage disequilibrium. Ordi- narily, unless there is some countervailing ptocess that maintains linkage disequilibrium between two genes, D is expected lo go to zero at a rate deter- mined by the recombination fraction between the genes. For unlinked genes, D decreases by one-half in each generation; for genes that recombine with a frequency r, D decreases by the fraction r in each generation. Significant link- age disequilibrium is usually found in natural populations for genes that are tightly linked, for genes that are within or near an inverted segment of chro- mosome, or for genes in plant species that regularly undergo self-fertilization. Significant linkage disequilibrium can also result from admixture of two or more subpopulations differing in allele frequencies. PROBLEMS 1 . Phenylketonuria is an autosomal recessive form of severe mental retarda- tion. About one in 10,000 newborn Caucasians are affected. Assuming random mating, what is the frequency of heterozygous carriers? 2. Mourant et a I. (1976) cite data on 400 Basques from Spain, of which 230 were Rh" and 170 were Rlf. Estimate the allele frequencies of D and d. How many of the R/* + individuals are expected to be heterozygous? 3 Kelus (cited in Mourant et al. 1976) reports a study of 3100 Poles, of whom 1101 were MM, 1496 were MM, and 503 were NN Calculate 'the allele frequencies and the expected numbers of the three genotypes and carry out a % 2 test for goodness of fit to random-mating proportions. 4. Consider an autosomal gene with four alleles A u A,, A ,, and /I, with respective frequencies 0.1, 0.2, 0.3, and 0.4. Calculate the expected geno- type frequencies under random mating. 5 Show that the proportion of heterozygous offspring from a heterozygous parent is '/ 2 in a population undergoing random mating for a single aene with two alleles. 6 If random mating with two alleles gives frequencies D, H, and R for homozygous dominant, heterozygote, and homozygous recessive show that DR = bffi. 7- When mating is random for a gene with two alleles A and a at frequen- cies p and q, show that the genotype frequencies of AA. Aa, and aa are i L 112 Chapter 4 the population structure of a widespread .species of freshwater fish. The low- est population level consists of a Ideal interbreeding population of animals within a stream. A stream may contain more than one such local population. The next-higher level in the hierarchy may be the organization of streams into groups feeding the same river. Another higher level may be rivers with- in watersheds. An even higher level of organization may be watersheds within continents. The aggregation of subpopulations into progressively more inclusive groups may continue for as many levels as is convenient and informative. It is inevitably somewhat arbitrary how the groups at each level are combined to form the next higher level in the hierarchy. The objective of the classification is informativeness: one tries to group the subpopulations in such a way as to highlight the genetic similarities and differences among them. If there were so much migration of fish among subpopulations that all members of the species constituted essentially a single, random-mating pop- ulation, then there would be no need to define a hierarchical population structure because it would be uninformative. However, most organisms do have significant population substructure. Reduction in Heterozygosity One of the important consequences of population substructure is a reduction in the average proportion of heterozygous genotypes relative to that expect- ed under random mating. The reason for the reduction in heterozygosity may be understood by considering the hypothetical example in Figure 4,1. The outline is the floor plan of a large barn. The organisms of interest are the mice concentrated primarily into two subpopulations of equal size at the west and east ends ol the barn. The movement of mice between the subpop- ulations is prevented by a large population of hungry and vigilant cats in the central area. The occasional mouse that comes out of its refuge is quickly eaten. (These hypothetical mice have not been endowed with the ingenuity to find alternative routes between the west and east ends of the barn, like sneaking along the rafters.) Because of chance effects in the founding of the subpopulations, the west and east subpopulations are completely homozy- gous for alternative alleles of a gene. All the mice in the west subpopulation are AA, and all those in the east subpopulation are act. In technical terms, the west subpopulation is fixed for the A allele (its allele frequency equals 1), and the east subpopulation is fixed for the a allele. The genotype frequencies of AA, Aa, and aa in the west subpopulation are 1, 0, and 0, respectively, and those in the east subpopulation are 0, 0, and 1 , respectively. Within each sub- population there is random mating, and the genotype frequencies, though extreme, still satisfy the Hardy -Weinberg principle. In particular, the frequencies of AA, Aa, and aa within each subpopulation are given by p 1 , 2pq, and q 1 , where p = in the east subpopulation, and p = 1 in the west Figure 4.1 An extreme example of the general principle that a difference in allele frequency among subpopulations results in a deficiency of heterozygotes. The floor plan is that of a hypothetical barn. The mouse subpopulations in the east and west enclaves are completely isolated owing to the cats in the middle. The west subpopulation is fixed for the A allele and the cast subpopulation for the a allele. Trapping mice at random in the area patrolled by the rats would yield an overall allele frequency of '/ 2 but no heterozygotes. subpopulation Therefore, within any one of the subpopulations in Figure 4.1, the frequency of heterozygotes equals the frequency expected with HWE. The situation regarding the total population in Figure 4.1 is very different, however, as there is an overall deficiency of heterozygotes. By "total popula- tion" in this context, we mean the aggregate of all mice without regard to the population substructure. Suppose we were unaware of the population sub- structure in the barn. We might then suppose that the barn contained a single randomly mating population. To study the total population of the barn, we trap mice at random in the center area, catching the occasional escapee from the cats. Because the subpopulations are fixed for either A or a, half the time we would trap an AA homozygote and half the time an aa homozygote. Con- sequently, we estimate the allele frequency of A as p = '/ 2 - Assuming random mating and Hardy -Weinberg genotype frequencies in the total population, the expected genotype frequencies of AA, Aa, and aa are given by the HWE as p 1 , 2 pq t and q l . Because the overall allele frequency of A among the trapped animals is V2, we would naively expect a fraction 2 x >/ 2 x l / i = '/ 2 of the animals to be heterozygous In fact, we would have caught no heterozy- gotes at allf CHAPTER 4 Population Substructure Hierarchical Structure F Statistics, Wahlund Effect DNA Typing Assortative Mating Inbreeding Inbreeding Coefficient opulation substructure is almost universal among organisms. Many organisms naturally form subpopulations in the form of herds, flocks, schools, colonies, or other types of aggregations. In addition, natural habitats are typically patchy, with favorable areas inter- mixed with unfavorable areas. Through time, even uniformly favorable areas can be disrupted by floods, fires, or other perils. When there is population subdivision, there is almost inevitably some genetic differentiation among the subpopulations. By generic differentiation we mean the acquisition of allele frequencies that differ among the subpopulations. Genetic differentia- tion may result from natural selection favoring different genotypes in differ- ent subpopulations, but it may also result from random processes in the transmission of alleles from one generation to the next or from chance differ- ences in allele frequency among the initial founders of the subpopulations. This chapter considers some of the consequences of population subdivision as well as other types of nonrandom mating. HIERARCHICAL POPULATION STRUCTURE A population is said to have a hierarchical population structure if the sub- populations can be grouped into progressively inclusive levels in which, at each grouping, the next lower levels are included ("nested") within the next higher ones. To consider a concrete example, imagine we were interested in 111 104 Chapter 3 quently, x 2 = 0.1631 ? x 489 - 13.0 with one degree of freedom, for which the associated probability is 0.0004. Thus, there is significant linkage disequilibrium between these genes. The value of D ma * is the smaller of 0.053 and 0.102, and so D ma » = 0.053. The magnitude of the linkage disequilibrium, relative to its theoretical maximum, is 0.012/0.053 = 22.6%. The j 2 can also be calculated from the expected numbers of the four gametic types, which are 410.1, 25.9, 49.9, and 3.1, respectively. PROBLEM 3.1 6 Use the formula for % 2 in Problem 3-15 to evaluate the statistical significance of the linkage disequilibrium between alle- les of the gene for alcohol dehydrogenase in Drosophila melanogaster and the presence or absence of an EcoRI restriction site located 3500 nucleotides downstream. The data are from a population descended from animals trapped at a Dutch fruit market in Groningen (Cross and Birley 1986). Adh F EcoRJ site present: 22 Adh F EcoRI site absent : 3 Adh s EcoRI site present: 4 Adh s EcoRI site absent: 5 ANSWER D = 0.085 and f = p z N = 0.453 2 x 34 = 7.0 with one degree of freedom; the associated probability value is approximately 0-01. The linkage disequilibrium is statistically significant and has a value of 49% of its maximum possible value. Linkage disequilibrium in local populations, such as seen in the preced- ing examples, can be caused by linkage disequilibrium in the founding pop- ulation that lias not yet had lime to dissipate due to the small value of r. Another possible cause of linkage disequilibrium is admixture of popula- Organization of Genetic Variation 1 05 tions with differing gametic frequencies. A third possibility is natural selec- tion differentially favoring some genotypes over others to such an extent that it overcomes the natural tendency for D to go to zero. Several examples in which linkage disequilibrium typically is present in natural populations should be mentioned here. One case concerns plants that ordinarily undergo self-fertilization, and examples are discussed m Chapter 4 in connection with the discussion of inbreeding. Another case involves cer- tain inversions that are polymorphic in populations of certain species of Drosophila, most notably D. psemioobscurti and D. subobscura and their rela- tives. A chromosome with an inversion, as the name implies, has a certain segment of its genes in reverse of the normal order. Because of the inverted segment the process of chromosome breakage and reunion in meiosis cannot be completed in the normal manner, with the result that the alleles in the inverted segment are usually unaffected by recombination and so they remain linked together. Because inversions prevent recombination, each inversion represents a sort of "supergene," and natural selection accumulates beneficially interacting alleles within each inversion. The beneficially inter- acting alleles are said to show genetic coadaptation. Linkage disequilibrium can also arise as an artifact of admixture of sub- populations that differ in allele frequencies. Organisms that are subdivided into local populations are said to have population substructure. An example of linkage disequilibrium arising from subpopulation admixture is illustrat- ed in Table 3.2. In this example, subpopulation 1 and subpopulation 2 are both in linkage equilibrium for the alleles of the A and B genes. Subpopula- tion 1 has an allele frequency of 0.05 for both A^ and B], and subpopulation 2 has an allele frequency of 0.95 for both A^ and Bj. An equal mixture of organ- isms from both subpopulations has the gametic frequencies shown in the last column of Table 3.2. The allele frequencies of A] and B] are both 0.50 in the TABLE 3.2 LINKAGE DISEQUILIBRIUM FROM ADMIXTURE OF SUBPOPULATIONS Chromosome frequency Subpopulation 1 Subpopulation 2 Equal mixture A 1 B l AyE 2 A 7 B X A 2 Ej Pu P« Pu Pll 0.0025 0.0475 0.0475 0.9025 0.9025 0.0475 0475 0.0025 0.4525 0.0475 0475 0.4525 D = P u P n -P ]2 D m;IX P 21 -0.0025 0475 -0 0025 0475 2025 -0.2500 0.2500 114 Chapter 4 This rather paradoxical result — that there is a deficiency of heterozygotes in the total population even though random mating takes place within each subpopulation — is a consequence of the difference in allele frequency among the subpopulations. Were the allele frequencies in both subpopulations the same, it would not matter whether we sampled from the west subpopulation, the east subpopulation, or from the area in between. We would recover geno- types in Hardy- Weinberg proportions because both subpopulations are geno- typically identical and in HWE. In an organism with hierarchically structured subpopulations, there is an analogous deficiency of heterozygotes at each level in the hierarchy. The following section examines the heterozy- gosities in more detail. Average Heterozygosity In the Mohave desert, local populations of the annual plant Lmanthus parryae are polymorphic for white versus blue flowers The plant is diminutive, aver- aging just 1 cm in height, and when the plant is in bloom, the ground cover of white flowers justifies the popular name "desert snow." Blue flowers result from homozygosity for a recessive allele. The geographical distribu- tion of the frequency q of the recessive allele across a region of the Mohave desert is illustrated in Figure 4.2. Each allele frequency is based on an exam- ination of approximately 4000 plants over an area of about 30 square miles (Epling and Dobzhansky 1942). Judging from the allele-frequency map in Figure 4.2, the highest frequen- cies of the blue-flower allele are largely concentrated at the west and east ends of the region in question. The unequal allele frequencies across the range imply a decrease in average heterozygosity, relative to HWE, analo- 10 miles Central Easf Figure 4.2 Estimated frequency of a recessive allele for blue flower color in. populations of Lmtmthu* patrync in an area of approximately 900 square miles in the Mohave desert. Each allele frequency is based on an examination of approxi- mately 4000 plants over an area of about 30 square miles. (After Wright 1943a.) Population Substructure 1 1 5 eous to the mouse example in Figure 4.1, though not as extreme Figure 4 2 shows the estimated allele frequency in each of 30 subpopulations Suppose each of the subpopulations is regarded as a random-mating unit in HWE for the flower-color alleles. The average heterozygosity among the subpopula- tions can be denoted as H s , where the subscript indicates subpopulation. The calculations are shown in the third column in Table 4 1; the heterozygosity in each subpopulation is calculated as 2pq, where p and q are the estimated frequencies of the alleles for white versus blue flower color, respectively, in each subpopulation. The H s tabulated at the bottom is the average of all the TABLE 4.1 HIERARCHICAL STRUCTURE OF UNANTH US PARRYAE Subpopulations Regions Total Allele Average allele Average allele Region frequency Heterozygosity frequency Heterozygosity frequenty Heterozygosity W 0.573 04893 717 4058 0.504 5000 0657 4507 302 4216 \ ' 339 4482 0.5153 4995 C 9 xOOOO 0.0000 032 0.0620'. 007 0.0134 008 0.0159 005 0.0100 009 0.0178 005 0.0100 010 0198 068 (1 1268 002 004b 004 0.0080". 220^ 126 0138 0272 E 0.106 1895 224 3476 0411 0.4842 0014 0276 0.1888 3062 1374 2171 Average heterozygosity r H s = 1424 ii K = 1589 //,- 2371 Source Data from Wright 1943a 116 Chapter4 r Population Substructure 117 subpopulation heterozygosities (counting the value 000 a total of nine times because of the nine different subpopulations in which q - 0.000). A second hierarchical level of population substructure is that of region — west (W), central (C), or east (E). To calculate the heterozygosity expected from HWE in each region, we first estimate the average allele frequency in the region by taking the mean allele frequency across all subpopulations in the region. For example, the average allele frequency if in region E is {0.106 + 0.224 + 0.411 + 0.014)/4 = 0.1888 In each region, the heterozygosity expected from HWE is calculated as 2pqr, where p and q are the average allele frequen- cies in the region. In region E, therefore, the regional heterozygosity equals 2 x (1 - 0.1888) x 1888 = 0.3062. The average heterozygosity within regions at the bottom of column 5 is denoted H R ; it is the weighted average of the regional heterozygosities, where each regional heterozygosity is weighted by the number of subpopulations in the region. In this example, ff R = (6 x 0.4995 + 20 x 0.0272 + 4 x 0.3062) /30 = 1589. Yet another hierarchical level of population substructure in Figure 4.2 is the total population — the aggregate population obtained by conceptually uniting all subpopulations to form a single random mating unit. The average allele frequency is the mean allele frequency across all subpopulations, and q = 0.1374. Then H T is calculated as 2pq = 2x0 8626 x 0.1374 = 0.2371 To sum up: • H s is the average HWE heterozygosity among organisms within random- mating subpopulations. • H R is the average HWE heterozygosity among organisms within regions. • H T is the average HWE hetero^ygosity among organisms within the total area. The concepts of hierarchical population structure and the various levels of heterozygosity were originally developed by Sewall Wright (1889-1988) to quantify genetic differences among subgroups at the various levels; he called his theory isolation by distance (Wright 1943a, 1943b). The motivation for developing such a method was summarized in the following passage. The term panmixia is a synonym for random mating. Study of statistical differences among local populations is an important line of attack on the evolutionary problem While such differences can only rarely represent first steps toward speciation in the sense of the splitting of the species, they are important for the evolution of the species as a whole. They provide a possible basis for intergroup selection of genetic systems, a process that provides a more effective mechanism for adaptive advance of the species as a whole than does the mass selection which is all that can occur under pan- mixia. Furthermore, the reduction in heterozygosity resulting from population substructure is intimately related to the reduction in heterozygosity caused by inbreeding— mating between relatives— as w shall stv later in this rhan- ter. Indeed, the relation of population substructure to inbreeding can be understood by interpreting each subpopulation as a sort of "extended 'fami- ly" or set of interconnected pedigrees. Organisms in the same subpnpulnhon will often share one or more recent or remote common ancestors, and so a mating between organisms in the same subpopulation will often be' a mating between relatives The larger the subpopulation, and the more recently it has been isolated, the smaller this inbreeding effect; nevertheless the analogy to inbreeding is valid. Wright's F Statistics To quantify the inbreeding effect of population substructure, Wright (1921) defined what has come to be called the fixation index. This index equals the reduction in heterozygosity expected with random mating at any one level of a population hierarchy relative to another, more inclusive level of the hier- archy. The fixation index is a useful index of genetic differentiation because it allows an objective comparison of the overall effect of population sub- structure among different organisms without getting into details of allele fre- quencies, observed levels of heterozygosity, and so forth. The genetic svmbol for a fixation index is F embellished with subscripts denoting the levels of the hierarchy being compared. For example, F SR is the fixation index of the subpopulations relative to the regional aggregates: F SR = 4.1 In words, Equation 4.1 defines F SR as the decrease of heterozygosity among subpopulations within regions (H R - tf s ), relative to the heterozygos- ity among regions (H R ). For the Ltmmthus example in Table 4 1 F«.» = ff) 1589 -0.1424)/0 1589 =0.1036. At the next level of the hierarchy, we may deline the fixation index F R7 as the proportionate reduction in heterozygosity of the regional aggregates rel- ative to the total combined population: FUT = 4.2 The data in Table 4.1 shows that F RT = (0.2371 - 0.1589)/0.2371 = 0.3299. Comparison of this value with F SR above already makes it clear that there is substantially more variation among regions (as measured by F R1 ) as there is among subpopulations within regions (as measured by F SR ). The comparison of the fixation indices at the two levels gives quantitative expression to the regional differences apparent in Figure 4.2. 118 Chapter 4 The fixation index F sl compares the least inclusive to the most inclusive levels of the population hierarchy and measures all effects of population sub- structure combined: F^ = H T -H S 4.3 From Table 4.1, F ST = (0 2371 - 0.1424)/0.2371 = 0.3993. The overall reduc- tion in average heterozygosity is therefore close to 40% of the total heterozy- gosity — a very substantial effect. The hierarchical F-statistics defined in Equations 4.1 through 4.3 are all types of fixation indices, but they differ in the reference populations: F SR is concerned with subpopulations (S) relative to the regional aggregates (R), F RT is concerned with the regional groupings relative to the total population (T), and F sr is concerned with the subpopulations relative to the total population. The index F ST is the most inclusive measure of population substructure. The mathematical relation between the three types of F statistics is demonstrated in the following problem. PROBLEM 4. 1 Show that Fsr, F m , and Fsr are related by me equation (l-FaOxfl-FRr^l-Fsr ANSWER From Equation 4.1, F^ = 1 - (H s / Hr), or 1 - F& « H$/H K . Equation 4.2 implies that F n = 1 - (H R ffli), or 1 - P m * Hr/H t . Final- ly, Equation 4.3 implies that F ST ■ 1 - (H S /H T ), or 1 - % * Hs/Hf. Now multiply the expressions for 1 - F SR and 1 - Fjtr together to obtain (1 - F^) x (1 - F m ) = (H 5 /H^} x (H K /H T ) = H s /H t » (1 - F ST )> For examining the overall level of genetic divergence among subpopula- tions, F S t is the informative statistic. Although F ST has a theoretical minimum of (indicating no genetic divergence) and a theoretical maximum of 1 (indi- cating fixation for alternative alleles in different subpopulations), the observed maximum is usually much less than 1. Wright (1978) has suggest- ed the following qualitative guidelines for the interpretation of F ST : • The range to 0.05 may be considered as indicating tittle genetic differen- tiation. Population Substructure 119 • The range 0.(J5 to 0.15 indicates moderate genetic differentiation • The range 0.15 to 0.25 indicates great genetic differentiation • Values of F ST above 0.25 indicate very great genetic differentiation. On the other hand, Wright also notes that, among subpopulations, "dif- ferentiation is by no means negligible if F ST is as small as 0.05 or even less." PROBLEM 4,2 Some subpopulations of Drosophila melanogaster show an altitudinal gradient in the allozymes of alcohol dehydroge- nase in which the frequency of the Adh-F allele increases with alti- tude. The data in the accompanying table are estimates of the allele frequency of Adh-F in seven samples of adult flies captured either in the mountains, in the foothills, or on the plains of the Caucasus Mountains of the former Soviet Union. Each allele frequency is based on electrophoresis of approximately 300 adult flies (Grossman et al. 1970). Calculate the F statistics F SE (subpopulations within elevations), F ET (elevations within the total), and F sr (subpopulations relative to the total). What do the magnitudes of the F statistics suggest regard- ing genetic differentiation among subpopulations in the frequency of Adh-F with respect to altitude? Altek Allele Allele Elevation frequency Elevation frequency Elevation frequency Mountain 0.321 Foothill 0.131 Plain 0.082 Mountain 0.226 Foothill 0.109 Plain Plain 0.088 0.035 ANSWER Let p represent the allele frequency of Adh-F. For each subpopulation, the HWE heterozygosity equals 2p(l - p), which for the seven samples are 0.4359 and 0.3498 (mountain), 0.2277 and 0.1942 (foothill), and 0.1506, 0.1605, and 0.0676 (plain). The average of these values is H s , which equals 0.2266. At each of the elevations, the aver- age allele frequency is the mean across the subpopulations sampled at that elevation. For mountain, foothill, and plain, these means equal 0.274, 0.120, and 0.068, respectively, yielding the elevation HWE het- erozygosities 0.3974, 0.2112, and 0.1273, respectively. (Your results may differ slightly according to the number of significant digits you 120 Chapter 4 r Population Substructure 121 carry along.) The average of the elevation heterozygosities equals the mean elevation heterozygosity (H E ), and it is the weighted average (2 x 0.3974 + 2 x 0.2112 + 3 x 0.1273)/7 = 0.2285. Finally the ailele fre- quency for the total heterozygosity is equal to the mean allele fre- quency across subpopulations, which is 0.142, yielding a total HWE heterozygosity (H T ) of 0.2433. The F statistics are Fse = (H B - H s )/Hi5 = 0.0081, F ET = (H T - H E )/H T = 0.0609, and Fsr = (H T ~ H s )/H r = 0.0684. [As a check, note that (1 - Fgc) x (1 - F^) = 1 - Fst-1 Judging from the magnitudes of the F statistics, it is clear that most of the differentia- tion among subpopulations is correlated with altitude; there is very little genetic differentiation among subpopulations at each elevation. The method of estimating the F statistics by replacing the parameters in Equations 4.1 through 4.3 with their observed or estimated values is not nec- essarily the best, particularly with small samples. Ideally, estimates of the F statistics should correct for the effects of sampling a limited number of sub- populations, as well as for the effects of sampling a limited number of organ- isms in each subpopulation. Methods for making these corrections have been suggested but are quite complex and raise additional issues. For an excellent discussion, see Weir and Cockerham (1984). Important issues are also addressed in Wright (1978, pp. 86-89), Curie-Cohen (1982), Nei and Chesser (1983), and Nei (1986). We will use the uncorrected estimation procedure, which is adequate for purposes of illustration. Genetic Divergence among Subpopulations The fixation index F ST defined in Equation 4.3 serves as a convenient and widely used measure of genetic differences among subpopulations. The identification of the causes underlying a particular value of F ST observed in a natural population is often difficult. Allele frequencies among subpopula- tions can become different because of random processes (random genetic drift) as well as by natural selection with complications from migration among the subpopulations. Difficulties in the assignment of cause do not, however, invalidate the usefulness of F S t as an index of genetic differentia- tion. The levels of genetic divergence among human subpopulations and among subpopulations of several other species are presented in Table 4.2. The values of F S r imply that genetic divergence between human subpopulations is quite small. Of the total genetic variation found in three major races (Cau- casoid, Negroid, and Mongoloid), only 7% (0.07) is ascribable to genetic TABLE 4.2 TOTAL HETEROZYGOSITY (W T ),AVERAGE HETEROZYGOSITY AMONG SUBPOPULATIONS {H s », AND FIXATION INDEX (F„) FOR VARIOU S ORGANISMS Organism Number of Number populations of foci Human 3 35 130 121 069 (nia)or races) Human, Yanonwma 37 15 1)39 0036 077 Indian villages. House mouse 4 40 097 0.086 113 (Mus muscuhis) Jumping rodent 9 18 037 0012 0.676 (Dipodomys ordn) Drosophdn 5 27 201 179 109 eqiwwxialts Hoiseshne crab 4 25 0-066 061 076 (Limn! its) Lycopod plant 4 13 071 051 282 (Lycopodnan luciduhim) Source. Protein electrophorehe data from Net 1975. differences_ajmong_racea. About93% of the total genetic variatiorUsJound within races. Similarly, of thejotil genetic variation found in the native Yanomama Indians otVeriezueh and Brazil, only 7 7% (0.077) is due to dif- ferences in allele frequency among villages. This result implies that 92.3% of the total genetic variation is found within any single village. Values of F &T for other organisms are quite variable, presumably because F ST is influenced by the size of the subpopulations— which is a major determinant of the mag- nitude of random changes in allele frequency— by the amount and pattern of migration between subpopulations, and by other factors, including natural selection. Table 4.2 provokes a brief discussion of the sensitive term race because the term is prone to misunderstanding or misuse. In population genetics, a race is a group of organisms in a species that are genetically more similar to each other than they are to the members of other such groups. Populations that have undergone some degree of genetic divergence as measured by, for example, F ST , therefore qualify as races. Using this definition, the human population contains many races. Each Yanomama village represents, in a cer- tain sense, a separate "race," and the Yanomama as a whole also form a dis- tinct "race " Such fine distinctions are rarely useful, however. It is usually more convenient to group populations into larger units that still qualify as races in the definition given. These larger units often coincide with races 122 Chapter 4 Population Substructure 123 based on physical characteristics such as skin color, hair color, hair texture, facial features, and body conformation. Contemporary anthropologists tend to avoid "race" as a descriptive term for human groups because cultural and linguistic differences, which are also important, are often discordant with genetic differences and sometimes discordant with each other. Here il must be pointed out thai the data in Table 4.2, which indicate much more genetic variation within than among human races, may be mis- leading. The conclusion is based primarily on genes determining allozymes, and it certainly is not true for genes influencing skin color, hair color, hair tex- ture, and other traits that most people think of in connection with the word "race." However, skin color and other prominent racial characteristics are used to delineate races precisely because racial differences for these traits are rather large, so the genes involved cannot be representative of the entire genome. On the other hand, allozyme loci may not be very representative of the genome either. See Nei and Roychoudhury (1982) for a review of the genetic relationship and evolution of human races. In human population genetics, the Wahlund principle is usually cited for its implication that fusion of subpopulations results in a decrease in the aver age frequency of children born with a genetic disease resulting' from homozygosity for a rare recessive allele, particularly an allele with a relative- ly high frequency in one of the subpopulations. Examples of harmful reces- sive alleles at h lg h frequency in some human subpopulations include in Caucasians, the alleles for a.-antitrypsin deficiency U] - 0.024) and cystic fibrosis (? = 0.022); in blacks, sickle-cell anemia ( (/ = 0.05 in American blacks up to q = 0. 1 in some African populations); in the Hop. and some other South- west American Indian tribes, albinism (q = 0.07); and, in Ashkenazi Jews Tay- Sachs disease (4 = 0.013). y The Wahlund principle for a recessive allele in two subpopulations is ilJustraied in Figure 4.3A. The west subpopulation has allele frequency <i, and genotype frequency af ; the east subpopulation ha, allele frequency q 2 and genotype frequency q\. The average frequency of the homozygous recessive ISOLATE BREAKING: THE WAHLUND PRINCIPLE The flip side of the coin of heterozygosity is homozygosity because a diploid organism that is not heterozygous must be homozygous. Mathematically, homozygosity = 1 - heterozygosity. Therefore, a corollary of the deficit in aver- age heterozygosity, relative to HWE, that results from population substruc- ture is that there is an equal excess in average homozygosity If the popula- tion substructure is eliminated and the former subpopulations undergo ran- dom mating, the average homozygosity decreases, and the average het- erozygosity increases by an equal amount. The phenomenon that the aver- age homozygosity decreases when subpopulations join together is called isolate breaking or the Wahlund principle, alter the Swedish statistician and human geneticist Sten Gcista William Wahlund (1901-1976) who first described the effect (Wahlund 1928). The subpopulations of hypothetical mice in Figure 4.1 afford an illus- tration of the Wahlund principle. As long as the cats keep the subpopulations separate, the homozygosity equals 1 because the west subpopulation is geno- typically AA and the east subpopulation is genotypically aa. If the cats were to disappear and the subpopulations of mice came together and practiced random mating, the genotype frequencies would be % AA, V 2 Aa, and V 4 aa. The homozygosity in the fused population is 1 / 4 + >/ 4 = x / 2 , which is a substan- tial decrease over the average in the subpopulation prior to fusion and ran- dom mating Not only is the total homozygosity reduced by population fusion, so is the average frequency of each homozygous genotype. Consider aa, for example Prior to fusion, the average frequency of aa across both sub- population equals V^; after fusion and random mating, the frequency of aa equals %. (A) Separate subpopulations Average- R Mpmto =?Il?!_ (B) Fused subpopulations p _ (fi + <t2 L igure 4.3 Illustration of the Wahlund principle. The frequency of homozy- h us recessiws after population fusion and random mating is less than [lie aver- se frequency before fusion. The difference in frequency of the homozygous riswvcs equals the variance in allele frequency among Ihe subpopulations 124 Chapter 4 T across both subpopulations equals {c\\ f q\)/2. The result of fusion of the sub- populations is shown in part B. Assuming that the subpopulations are equal in size, the allele frequency in the combined population is If = (<ji + (fi)/2, and the genotype frequency with HWE equals fj l . Therefore, were the subpopu- lations in part A to fuse and come into HWE, the average frequency of homozygous recessives would be reduced by an amount given by: R - 9? +t ?2 ~ n l ~ K fused <] = ^~(i) 2+ ^2~qf 4.4 = °S In Equation 4 4, we leave it as an exercise to verify that the expressions in q x and q 2 on the first and second lines are equal The symbol o<f is the variance in allele frequency among the original subpopulations Because the variance is always nonnegative, isolate breaking always decreases the average fre- quency of homozygous recessives unless the allele frequencies are equal to begin with. Furthermore, the result in Equation 4 4 is true for any number of subpopulations of equal or unequal size; in words: Fusion of subpopulations with random mating and HWE decreases the aver- age frequency of homozygous recessives by an amount equal to the variance in allele frequency among the original subpopulations. To illustrate the effect of isolate breaking, imagine a subpopuladon of gray squirrels that has a high frequency of albinism equal to 16%. (Albinism is an inherited absence of pigment resulting from a homozygous recessive gene.) In a nearby forest there is another subpopulation of equal size in which the albino mutation is absent, so that the allele frequency in this sub- population is 0. Overall, the average frequency of albinos in the two popula- tions is (0.16 -+■ 0)/2 = 8% If the two subpopulations fused with random mating and HWE, the allele frequency of the albino mutation in the fused population would be (0.4 + 0)/2 = 0.2, and the frequency of the homozygous recessive would equal 0.2 2 = 4%. The frequency of albinos in the fused popu- lation is substantially smaller than the average frequency in the original sub- populations. PROBLEM 43 Tay-Sachs disease is an autosomal-recessrvfc degen- erative disorder of the brain that usually leads to death in infancy or early childhood. Among Ashkenazi Jews, the incidence of the condi- tion is about 1 in 6000 births but, in other groups, the incidence i» Population Substructure 125 about 1 in 500,000 births (Myrianthopoulos and Aronson 1966) What incidence of the disease would be expected among the offspring of matings of Ashkenazi f ews with members of other groups? If these offspring were to mate randomly among themselves, what incidence of the disease would be expected in future generations? ANSWER The allele frequency of the Tay-Sachs mutation among Ashkenazi Jews is e stimated as ft = V(l /6,000) = 1.291 x 10""*; in other groups, ft = V(l/500,000) = 1.414 x 10~ 3 . In matings between members of the two groups, the expected frequency of homozygous recessives is q } q 2 = 1 -826 x 10~ 5 , or about 1 in 55,000 births. There is actu- ally a greater reduction in the first generation than in subsequent gener- ations because each mating in the first generation combines a high- risk gamete with a low-risk gamete. The allele frequency in the first-gen- eration offspring is (^ + <fc)/2 = 7.162 x 10" 3 and, with HWE in subse- quent generations, the homozygous recessive frequency stabilizes at (7.162 x 10" 3 ) 2 = 5.130 x 10" 5 , Or about 1 in 19,000 births. (The fact that homozygous recessives do not reproduce has been ignored because the effect is negligible.) Wahlund's Principle and the Fixation Index Equation 4.4 applies equally well to AA hornozygotes as to aa homozygotes. Therefore, letting P represent the frequency of homozygous AA genotypes, we can write "•separate ' fuwd ~ CTj, 45 When there are only two alleles, the total reduction in homozygosity must be the summalion of Equations 4.4 and 4.5, which equals of, + of Because there are only two alleles, it is also true that of = of, which we will write as o . Hence, the total reduction in homozygosity from the Wahlund effect upon population fusion and HWE can be expressed as follows: Reduction in total homozygosity = 2a 1 On the other hand, the reduction in total homozygosity with popula- tion fusion must also equal the increase in heterozygosity — the term H-[ - H s in Equation 4 3 — which is the numerator of F S1 . Hence, F sl = (H T - H s )/H r = 2o 2 /H r . However, H T is the heterozygosity with HWE using the average allele frequencies — pand t/— across subpopulations. Therefore, the 1 26 Chapter 4 Population Substructure 1 27 connection between the fixation index F sr and the variance in allele fre- quency is pq 4.6 Consequently, the F statistics at the various levels of a hierarchical popu- lation are related to the variances in allele frequencies among the subpopula- tions grouped together at the various levels. Equation 4.6 affords a convenient method of estimating F ST from allele-frequency data. For exam- ple, among the subpopulations of Linanthus in Figure 4.2, the variance in allele frequency is 0.0473. Earlier we calculated (he average allele frequencies as p = 0.8626 and q = 0.1374. Hence, o 2 /(f x q) = 0.3993, which confirms the previous calculation that F^ = 0.3993. (The values as stated may differ slightly from yours because they were calculated with more than four significant digits.) PROBLEM 4.4 The data in the accompanying table are the allele fre- quencies of several genes in three human subpopulations: (A) blacks from West Africa; (B) blacks from Claxton, Georgia; and (C) whites from Claxton, Georgia (Adams and Ward 1973). Each gene has two predominant alleles and may, for purposes of this problem, be con- sidered to have only two alleles. The genes control the MN blood group (alleles M and N), the Ss blood group (alleles 5 and 5), the Duffy blood group (alleles Ftf and Fy b ), the Kidd blood group (alleles flf and j£), the Kell blood group (alleles Jtf and Jf), the enzyme glu- coses-phosphate dehydrogenase (alleles G6PD~ and G6PD\ and JJ- hemoglobin (alleles f and p + ). For each gene, use Equation 4.6 to estimate F ST for the comparison A versus B and for the comparison A versus C. Classify each F^ as indicating tittle, moderate, great, or very great genetic differentiation according to Wright's qualitative guide- lines. Note: In comparing two subpopulations with two alleles in each, the variance in allele frequency is a 2 = (pi - p 2 ) 2 /4. Subpoputatlon e Gene Blacks (West Africa) Blacks (Ceorgla) Whftei (Georgia) M S Jk" G6PD' 0.474 0.172 0.693 0.117 0.176 0.090 0.484 0.157 0.045 0743 0.123 0.118 0.043 0.507 0.279 0.422 0.536 0.002 I ANSWER The estimates and their qualitative interpretations are as shown in the table. It is clear that the degree of genetic divergence between West African blacks and Georgia blacks, as assessed by the average Fst value, is relatively small. However, some of the genes show substantial genetic divergence between blacks and whites. Note, however, that the fixation index can differ substantially from one gene to another. This compilation of genes includes gradations of genetic divergence ranging from little to very great. Gene A versus B A versus C M $ Pf G6PTT P s Average 0.0001 mum 0.0004 (tittle) 0.0230 (little) 0.0031 Uittie) 0.0001 (little) 0.0067 (little) 0.0089 (little) 0.0060 (little) 0.0011 (little) Q 01 fA (tittle) 0.2676 (very great) 0.0260 (tittle) 0.0591 (moderate) 0.0965 (moderate) 0.0471 (Utile) 0.0734 (moderate) Genotype Frequencies in Subdivided Populations In many organisms in which the population structure is hierarchical, it is useful to be able to calculate directly the average genotype frequencies across all subpopulations. Equations 4.4 through 4.6 make it possible to deduce the average genotype frequencies. Consider first Equation 4.4, which pertains to the genotype frequency of AA. The quantity called f\ cp ,„,,i P is what we wish to calculate: it is the average frequency of AA across subpopulations The quantity D fllwd equals p 2 — the genotype frequency of A A with population fusion and HWE. The value of a 2 is also known from Equation 4 6 it equals Fst x p x Jj Putting all this together, the average genotype frequency of AA across subpopulations must equal p 2 +• F S \fq. Likewise, interpreting Equa- tion 4.4 in the same manner as Equation 4.5 yields the average genotype fre- quency of aa across subpopulations as q 2 + F S | pq Because every genotype that is not homozygous must be heterozygous, the average genotype frequency of heterozygofes across subpopulations is given by 1 - (f 2 + F S7 pq] - (q 2 + F^pq). Note that I - J> 2 -q 2 = 2 Jiq and so the average frequency of heterozygotes simplifies to 2pif-2 pq F M . The genotype frequencies in a subdivided population are important enough to be displayed. 128 Chapter 4 AA: fy ? +pTjF^ Aa: 2jitj-2/»ijr ST 4.7 These genotype frequencies are the average genotype frequencies across all subpopulations. They do not obey the Hardy- Weinberg principle because there is an excess of homozygotcs and a deficiency of heterozygotes relative to HWE. The result is somewhat paradoxical because, within any particular subpopulation, the genotype frequencies do obey the Hardy- Weinberg prin- ciple with whatever allele frequencies are found in that subpopulation. The reason for the validity of HWE within each subpopulation is the assumption of random mating within each subpopulation. Ihe reason for the departure from HWE in the population as a whole is that Ihe subpopulations differ in allele frequency Because the allele frequencies differ, random mating within each subpopulation is not equivalent to random mating among all the organ- isms in the entire population. From the expressions in Equation 4.7, it is clear that the value of F ST deter- mines the degree of departure from HWE. If F ST = 0, the second term in each expression vanishes, and the genotype frequencies reduce to the HWE; on the other hand, F ST = means that there is no variation in allele frequency among the subpopulations for the gene in question. Because F ST may vary from one gene to the next, other genes in Ihe same subpopulations may have nonzero values of F ST . The extreme case is F S7 = 1, which happens when two subpopulations are fixed for alternative alleles. In this case, the average allele frequencies are V 2 for each allele and the average genotype frequencies of AA, An, and aa across subpopulations are L / 2 , 0, and V 2 , respectively. This case is illustrated in Figure 4.1. POPULATION GENETICS IN DNA TYPING The term DNA typing means the application of molecular genetics to high- ly polymorphic genetic markers for the purpose of matching DNA samples from unknown people with those of known suspects. Applications include paternity testing, in which DNA from a child is matched against that of an accused father, and criminal investigation, in which a crime-scene sample of DNA from blood, semen, or other sources is matched against that of one or more suspects. DNA typing undoubtedly ranks with the use of fingerprints as a major innovation in personal identification. In theory, DNA typing is not as powerful as ordinary fingerprinting. Fin- gerprints result from the pattern of raised skin ridges that carry sweat glands. The ridge pattern on each finger may form an arch, loop, whorl, or other design. The ridges vary in pattern from one person to the next so greatly that Population Substructure 129 each person has unique fingerprints suitable for personal identification. When the fingers are formed in the embryo, the fingertips develop as fluid- filled pads. The fluid is later resorbed and the expanded skin collapses, form- ing the ridges There is a strong random component to the manner in which the skin collapses, and so the details of the fingerprint pattern differ in each finger and in each person. Even identical twins have different fingerprints. However, certain general features of the fingerprints are strongly inherited — for example, the total number of ridges on all the fingers, without regard 1o pattern. The Dionne quintuplets — five Canadian girls born in 1934, all formed from the splitting of a single fertilized egg — had total ridge counts ranging between 99 and 102; by comparison, their older siblings had total ridge counts of 69, 78, and 139 It is the random component in fingerprint ridge pattern that makes fin- gerprints so powerful for personal identification. DNA types are inherited and so are not necessarily unique in each person. Even for a highly polymor- phic marker in which both parents are heterozygous — for example, in the mating A,A } x A k A) — any particular genotype in an offspring has a i ( i chance of being matched in a sibling owing to Mendelian segregation. Thus, strong evidence that an unknown DNA sample comes from a particular suspect can come only from the matching of a combination of genotypes across a number of polymorphic loci. The strength of the evidence increases with the number of loci that are examined and number of alleles present in the population The greater the number of loci, and the more highly polymorphic Ihe loci, the stronger the evidence linking the suspect to the unknown sample. Although matching DNA types may provide strong evidence that a suspect is the source of an unknown sample, a DNA mismatch is usually conclusive. When the DNA of a suspect contains alleles thai are clearly not present in the unknown sample, then the sample must have originated from a different per- Polymorphisms Based on a Variable Number of Tandem Repeats (VNTR) The type of polymorphism usually used in DNA typing in the United States is illustrated in Figure 4.4. Each allele of a locus is defined by the size of a restriction fragment that hybridizes with a locus-specific probe in a Southern blot (Chapter 2). The restriction fragments differ in size according to the number of copies they contain of a short sequence of nucleotides repeated in tandem. When there are more copies of the repeating unit, the restriction fragment is of greater size. A polymorphic gene of this type is called a VNTR polymorphism, which means that the restriction fragments contain a vari- able number of tandem repeats. VNTRs are employed in DNA typing because many alleles are possible because of the variable number of repeat- ing units. Although many alleles may be present in the population as a 1 30 Chapter 4 Probe DNA Allele 1 Allele 2 Allele 3 Allele 4 AllclcS Allele 6 Restriction site Restriction site Sequences repeated in tandem Fiqure 4 4 Allelic variation resulting from a variable number of units repeated in tandem in a nonessential region of a gene The probe DNA detects a restric- tion fragment for each allele. The length of the fragment depends on the num- ber of repeating units present. (From Hart! 1994.) whole any one person can have no more than two alleles of each VNTR locus. An example of a VNTR used in DNA typing is shown in Figure 4.5. The lanes in the gel labeled M contain multiple DNA fragments of known size to serve as molecular-weight markers. Each numbered lane contains DNA from a different person. Two typical features ol VNTRs are to be noted: • Most people are heterozygous for two VNTR alleles with restriction frag- ments of different size. Heterozygosity is indicated by the presence of two distinct bands. In Figure 4.5, only the persons numbered 2 and 5 appear to be homozygous for a particular allele. • The restriction fragments from different people cover a wide range of sizes The variability in size indicates that the population as a whole con- tains many VNTR alleles. Figure 4.5 also makes it clear why VNTR polymorphisms are useful in DNA typing: each of the 13 people has a different DNA type (pattern ol bands) for this VNTR and therefore could be distinguished from any other person On the other hand, the uniqueness of each DNA type in Figure 4.5 results in part from the small sample size. If more people were examined, then DNA types that matched by chance might well be found among unre- Population Substructure 131 Ml 2 3 4 C M 5 6 7 8 9 M 10 11 12 13 M Figure 4.5 Genetic variation in a VNTR used in DNA typing. Each numbered lane contains DNA from a single person. After digestion of the DNA with a restriction enzyme, the fragments are separated by electrophoresis and hybrid- ized with a radioactive probe DNA. The lanes labeled M contain molecular- weight markers; lane C is another tvpe of internal control. (Courtesy of R. W. Allen) lated people. For example, in one study of five VNTR loci, the chance of a match between unrelated people ranged from 1 /20 to 1 /200, depending on the locus (Herrin 1993). Although less common, chance matches for two VNTR loci can also be found among unrelated people. The same study found two-locus matches at frequencies of 1/2,500 to 1/50,000. Even chance match- es for three VNTR loci are far from impossible In one study of Italians from Milan, three-locus matches were found at a frequency of approximately 1/1,200 (Krane et al. 1992). Because of the possibility of chance matches between VNTR types, applications of DNA typing are usually based on at least three loci and preferably more. Matches at 7 lo 9 VNTR loci are virtual- ly definitive of identity — barring technical errors in the DNA typing itself (such as mislabeling of blood samples) and except for identical twins. DNA typing can be exclusionary as well as incriminating. For example, if the DNA type of a suspected rapist does not match the DNA type of semen taken from the victim, then the suspect could not be the perpetrator — unless there is some reason to suspect that the test itself was faulty. For example, Figure 4.6 shows the DNA profiles of nine VNTR loci among three suspects and from evidence recovered in seven serial rape cases. The label M denotes molecular- weight markers (present in four lanes in each panel), S1-S3 denotes three suspects in the cases, and U1-U7 denotes DNA from semen 1 32 Chapter 4 samples recovered from the seven victims. Suspects Si and S3 are excluded by I he UNA typing, but S2 matches at all nine loci Based on this and other evidence, a jury convicted suspect S2 of 81 criminal counts related to these and other cases. He was sentenced to 139 years in prison and will not become eligible for parole until the year 2087. Match Probabilities with Hardy-Weinberg Equilibrium and Linkage Equilibrium If a person is found whose DNA type matches that of a sample found at the scene of a crime, how is the significance of the match to be evaluated? The significance of the match depends on the likelihood of it happening by chance, and hence matches of rare DNA types are more telling than match- es of common DNA types. Initially, the method for estimating the frequency of a DNA type in the population was to use a cross-multiplication square like that in Figure 3.6, extended to multiple alleles, to calculate the expected fre- quency of the particular genotype for each VNTR locus; this calculation assumes Hardy-Weiberg equilibrium (HWE). The locus-by-locus frequencies were then multiplied together to obtain the expected frequency of the multi- locus match; this calculation assumes linkage equilibrium. With HWE and linkage equilibrium, the expected frequency of a DNA type in the population as a whole is calculated as 4.8 where capital II means chain multiplication. The first multiplication is across all loci presumed to be homozygous owing to the presence of a single band in the gel; for each locus, p, is the trequency of the allele that is homozygous. The second multiplication is across all heterozygous loci and, for each locus, the factor is two times the product of the frequencies of the alleles that are heterozygous. Because human subpopulations can differ in their allele fre- quencies, the calculation would be carried out using allele frequencies among Caucasians for white suspects, using those among blacks for black suspects, and using those among Hispanics for Hispanic suspects. Effects of Population Substructure The multiplication in Equation 4.8 makes a number of assumptions about human populations: (1) that the Hardy-Weinberg principle holds for each locus, (2) that each locus is statistically independent of the others so that the multiplication across loci is justified, and (3) that the only level of popula- tion substructure that is important for DNA typing is that of race. Critics of the multiplication rule argued that genetically important subpopulations IV > ri2 W homozygous hftcrtvygous loci loci Population Substructure 133 = — 5 r = r= **~ ~* = r s^^-< M'i'i i'iiu'miu l^ Ufi Li" m M Si S2 b3 M ui U2 jj3MU4iaujuUZM- T^Ts2Sra InTljmMTMTgTirrrM- "M^TsT3lWiMMDI0iD(rWlvr Figure 4.6 An example of DNA typing. Suspect S2 matches evidence samples in seven rape cases (U1-U7) for each of nine VNTR loci (D1S7, D2S44, D4S139, and so forth) Suspects SI and S3 do not match and are excluded. The lanes labeled M contain molecular-weight mark- ers. (Courtesy of Steven L. Redding, Office of the Hennepin County District Attorney, Min- neapolis, and Lowell C. Van Berkorn and Carla f . Finis, Minnesota Bureau of Criminal Apprehension.) 1 34 Chapter 4 need not coincide with racial designations. For example, the term "Hispanic" includes a mixture of different subpopulations with variable amounts of Spanish, native American Indian, and African ancestry. Similarly, there are potentially important differences in allele frequency among Caucasian populations (for example, Finnish people versus Italians) and among biack populations (for example, blacks from Africa versus blacks from Trinidad). Furthermore, if the allele frequencies of different VNTRs differ among subpopulations, then the loci are not statistically inde- pendent — even if they are genetically unlinked — and so the multiplication across loci is unjustified. Because of population substructure, DNA matches across multiple VNTRs could be more common among people within a par- ticular ethnic group than among people drawn at random from the popula- tion as a whole, and so calculations of genotype frequency should be based on the ethnic group of the accused person and not on the race as a whole. On the other side, defenders of the multiplication rule argued that popula- tion substructure would have a relatively minor effect on the final outcome of the calculation and that what matters most is not a high degree of accu- racy but rather a general sense of whether a particular multilocus genotype is rare or common. After much acrimony in the scientific community and in courts of law, a panel of the National Research Council (NRC 1992) recom- mended a compromise called the ceiling principle in which a modified mul- tiplication procedure was adopted using, for each allele frequency, a "ceil- ing" equal to the larger of either 0.10 or the upper 95% confidence limit of the highest frequency of the allele observed among at least three racial data- bases. Even this recommendation proved controversial because some population geneticists regarded the compromise formula as too conservative. Continu- ing controversy prompted the formation of a second panel of the National Research Council (NRC 1996), which recommended the use of a modified product rule that takes moderate population substructure into account. According to this recommendation, in most cases the match probability may bo calculated according to the left-hand side of (he following- lirtrnvvRini'. ln<! Inil In this expression, r>, and p, have the same meaning as in Equation 4.8, and F SI is the fixation index among the subpopulations in the larger whole (typi- cally a major racial group). The use of the calculation is justified by the inequality Each factor on the right-hand side of this inequality is the per- locus genotype frequency calculated from Equation 4.7, which takes F ST into Population Substructure 1 35 account. The left-hand side is greater than the right-hand side because, for each homozygous locus, it can be shown that 2p, > p} + p,(l - ?>,)F S -,; and , for each heterozygous locus, it is clear that 2p,p ) > 2p,p l - 2p,p, F ST because F ST > 0. Equally as important as the calculation itself, the committee emphasized, was the principle that no probability value should be cited unless accompanied by an appropriate 95% confidence interval to indicate its degree of reliability. The 1996 report also enumerated a number of special situations in which alternative formulas are required because of population substructure or inbreeding. INBREEDING When matings take place between relatives, the pattern of mating is called inbreeding. In human beings, the closest degree of inbreeding usually encountered in most societies is first-cousin mating. Many plants regularly undergo self-fertilization, and some insects regularly practice brother-sister mating. Inbreeding need not unite close relatives, however. As we shall see, a certain level ol inbreeding is inescapable in small subpopulations because the members of a subpopulation typically share recent or remote common ancestors. The common ancestry between mating pairs constitutes inbreed- ing. Hence, the genetic differentiation among subpopulations described by the hierarchical F statistics can be interpreted as a sort of inbreeding effect resulting from population substructure The relalionship between popula- tion substructure and inbreeding is a subtle one, but it has profound conse- quences in population genetics. Genotype Frequencies with Inbreeding The main effect of population substructure is a decrease in average het- erozygosity among subpopulations, relative to the heterozygosity expected with random mating in a hypothetical total population. Likewise, the main effect of inbreeding is to produce organisms with a decrease in heterozygos- ity, relative to the heterozygosity expected with random mating in the same subpopulation. The decrease in heterozygosity due to inbreeding can be illustrated with the example of repeated self-fertilization. Consider a self- fertilizing population of plants that consists of V4 AA, '/ 2 Aa, and '/ 4 aa geno- types, which are in Hardy- Weinberg proportions. Because each plant under- goes self-fertilization, the AA and aa genotypes produce only AA and aa off- spring, respectively, and the Aa genotypes produce V 4 AA, 1 / 2 Aa, and '/ 4 aa offspring. After one generation of self-fertilization, therefore, the genotype frequencies of AA, Aa, and aa are: AA: v 4 xi + y 2 x'/ 4 = Yb 1 36 Chapter 4 An: V 2 xV 2 = % 1 /4Xl + l / 2 x'/ 4 = Vh r These genotype frequencies are no longer in Hardy-Weinberg propor- tions. There is a deficiency of heterozygous- genotypes and an excess of homozygous genotypes. After a second generation of self-fertilization, the genotype frequencies are 7 / ]fl AA, 2 / 16 An, and ? / ]ft aa, which have an even greater deficiency of heterozygoses Note, however, that the allele frequency of A remains constant. Denoting the allele frequency of A as p, then: In the initial population: Alter one generation of selfing: After two generations of selfing: P='/4 + , / 2 xV2= , /2 p = 3/ 8 + V 1 xV 8 ='/ 2 The example of self-fertilization illustrates the general principle that inbreeding, by itself, does not change the allele frequency. One assumption required for constant allele frequencies under inbreeding is that all geno- types must have an equal likelihood of survival and reproduction, which is to say that no natural selection takes place. If there is selection, then the allele frequencies can change with inbreeding (or, lor that matter, with any mating system). The effects of inbreeding can be made quantitative by comparing the pro- portion of heterozygous genotypes among inbred organisms with the pro- portion of heterozygous genotypes expected with random mating. To be precise, consider a gene with two alleles, A and a, at respective frequencies p and q (with p +- q = 1) Suppose thai the frequency of heterozygous genotypes in a sub-population of inbred organisms is some quantity H t . Were the sub- population undergoing random mating, the HWE frequency of heterozygous genotypes would be 2pq. However, for the sake of generality, we will denote the random-mating heterozygosity by the symbol H n . The effects of inbreed- ing can be defined as the proportionate reduction in heterozygosity relative to random mating. This value is expressed mathematically as (H - H } )/H ; this ratio is usually denoted by the symbol F, which is called the inbreeding coefficient At this point, the use of F for the inbreeding coefficient may seem a poor choice in view of the use of F S1 and related symbols for measuring the effects of population sxibstructure, but we will see in a lew moments that inbreeding and population substructure are intimately related. Thus we define F = 4.9 Population Substructure 137 In biological terms, F measures the fractional reduction in heterozygosity of an inbred subpopulation relative to a random-mating subpopulatton with the same allele frequencies. Because H„ = 2pq, the frequency of heterozygous genotypes in the inbred subpopulation can be written m terms ot F is H, = H„ - H F = H (1 - F) = 2 P q{\ - F). The frequency of AA homozygous genotypes in an inbred subpopula- tion can also be expressed in terms of F Suppose that the proportion of AA genotypes is denoted P. Because the allele frequency of A is p, we must have, by Equation 4,9 that P + H r /2 = p. But H, = 2/w(l - F), and so P = p-2 P q(\~F)/2. PROBLEM 4.5 Use the relation P = p - 2pq(l - F)/2 and the fact that p + q = 1 to show that F = p 2 + pqF. Show also that P can be written as P = p 2 (l-F)+pF„ ANSWER P*p- 2pq(l ~F)/2 = p -pq{\ -F) = p-pq+pqF = p(l-q) + pqF = p + pqF. This establishes the first identity. Then, substituting for q in the second term, P = p z + p(l - p)F = p 2 + pF- p 2 F = p 2 (l - F) + Problem 4.5 shows that the frequency of AA genotypes in an inbred sub- population equals p 2 (l - F) + pF. In a similar manner, it can be shown that the frequency of aa genotypes is q 2 (\ - F) + qF. In summary, in a subpopulation of organisms with inbreeding coefficient F, the genotype frequencies are expected in the proportions: AA : p 2 (l -F) + pF = p 2 + pqF Aa:2pq(l-F) = 2pq-2pqF an : q 2 {\ -F) + qF = q 2 + pqF 4.10 The expressions at the far right in Equation 4 10 facilitate comparison of the genotype frequencies expected with inbreeding relative to those expected with HWE, With inbreeding, there is a deficiency of heterozygotes equal to 2pqF and an excess of each homozygous class equal to half the deficiency of heterozygotes. The biological reason that the missing heterozygotes are allo- cated equally to the two homozygous classes is that each heterozygous geno- 1 38 Chapter 4 type contains one A and one a allele Notice that when there is no inbreeding (f = 0), the genotype frequencies are in the familiar Hardy-Weinberg propor- tions; with complete inbreeding (F = 1), the inbred subpopulation consists entirely of AA and aa homozygotes in the frequencies p and q, respectively. If a gene has multiple alleles A lt A 2 , . . . , A„ at respective frequencies p h lh> • ■ - r p n (with p\ + p 2 + ••■ + p„ = 1), then in a population with inbreeding coefficient F, the frequencies of A,A t hornozygotes and A,A f heterozygotes are as follows: p?(l-F) + P ,F 2p,p,{l-F) 4.11 We are now in a position to apply the Equations 4.10 and 4.11 to real data. PROBLEM 4.6 Plants able to undergo self-fertilization are said to be self-compatible. In a population of self-compatible plants, if each plant undergoes self-fertilization a fraction s of the time and other- wise mates randomly, then it can be shown (Crow and Kknura 1970; Hedrick and Cockerham 1986) that F very quickly attains the value F = s/(2 - s). Phlox cuspidata is self-compatible, and for this species the amount of self-fertilization is estimated at approximate- ly s = 0.78 (Levin 1978). From s we can predict the inbreeding coef- ficient as F = 0.78 /{2 - 0.78) ±= 0.64. In a Texas population of P. cuspidata, Levin (1978) found two electrophoretic alleles of the phosphoglucomutase-2 gene, designated Pgm-T and Pgtn-2 b . In a sample of 35 plants, there were 15 Pgm-T '/ Pgm-T ', 6 Pgm-T 1 j 'Pgm-t ', and 14 Pgm-2 h /Pgm-2 h genotypes. Are these numbers consistent with the estimate F = 0.64? (Note: The % 2 in this case has one degree of freedom because only the allele frequency is estimated from the data; if F also were estimated from the data, rather than being cal- culated independently from the degree of self-fertilization, then there would be zero degrees of freedom and no goodness-of-fit test would be possible.) AN SWER The al lele frequenc ies of Pgm-T and Pgm-2 h are estimated as (30 + 6)/70 = 0.514 and 1 - 0.514 = 0.486, respectively. The hypoth- Population Substructure 1 39 *ste is that F * 0.64, and so 1 f F = 0.36. The expected numbers of the linotypes M, ab t and M are!, respectively, [(0.514) 2 (0.36) + (0.514) (0:64)K3S) » 14.8, [2(0314KQ.486}(0.36)]{35) = 6.3, and [(0.486) 2 (0.36) + {0.486){0.64)](35) « 13.9. With these expectations, the * 2 = 0.02 with one degree of freedom, and the associated probability is about 0.96. The fit to the mbfeeding model is excellent f Ff06liM4f AiiumMf tftatl'so 0.64 in Texas populations of Phlox ■■' eusptialk, calculate the genotype fequeneies expected from the four , afleles dl tfw* gene Afih codirig for alcohol dehydrogenase by using the *;f allele frequencies O.ll (Atik-i), 0.84 {Adh-2), 0.01 (Adh-3), and 0.04 I ! 0dh-4)ft^m t>fobiettt &10 in Chapter 3. ..i . ^uMMi.^ .k i -*, „ _.. wtprefsslohs in Equation 4.11, the expected ft^daerides Ate: Adh-1 /Adh-1 * 0.0748, Adh-1 /Adh-2 = M/Mk*2 - 0.7916; Adh-1 /Adh-3 = 0.0008, Adh-2 /Adh-3 = S4J Adh-3 $ 0.006^ Adh-1 /Adh-4 = 0.0032, Adh-2/ Adh-4 = \-3fAdh<4 * G.tXmMdh'4/Adh-4 = 0.0262. Relation Between the Inbreeding Coefficient and the F Statistics There is an intimate relation between the inbreeding coefficient F and the hierarchical F statistics examined in the first section of this chapter. Each of the hierarchical fi statistics is also a type ol inbreeding coefficient that mea- sures the reduction in heterozygosity at any level of a population hierarchy, relative to a higher level. The connection between the inbreeding coefficient and the F statistics is indicated by the formal similarity between Equation 4.7 and the right-hand side of Equation 4.10. To incorporate the inbreeding coef- ficient F from mating between relatives into the hierarchical framework, we will embellish it with the subscript IS. In words, F, s is the inbreeding coefli- L 140 Chapter 4 dent of a group of inbred organisms relative to the subpopulation to which they belong. The value of F IS is the reduction in heterozygosity of the inbred organisms, and the genotype frequencies among the inbred organisms are given by Equation 4.10 with p and q equal to the allele frequencies in the rel- evant subpopulation. Within each subpopulation there is random mating, and so the genotype frequencies are given by the HWE. Among the subpop- ulations, however, there is a reduction in average heterozygosity, relative to the total population, because mates within subpopulations often share remote common ancestors. The sharing of remote common ancestors explains the apparent paradox that inbreeding accumulates even when there is random mating within a subpopulation. The reduction in heterozygosity attributable to this type of inbreeding, relative to the total population, is mea- sured by F ST , and the appropriate formulas for the genotype frequencies, averaged across the subpopulations, are given in Equation 4.7, in which p and q are the average allele frequencies among the subpopulations. A population geneticist is often interested not only in F I5 but also in F iT . The former is the heterozygosity of a group of organisms relative to the sub- population to which they belong; the latter is the heterozygosity of the inbred organisms relative to the total population. Hence, F, T is the most inclusive measure of all inbreeding. It embraces not only the effects of mating between close relatives within a subpopulation but also the accumulated inbreeding resulting from mating between remote relatives at all levels of the population hierarchy. An expression for F JT is implicit in the definitions. For consistency, we will use the symbol H s to denote the heterozygosity in a particular sub- population. Hence, Equation 4.9 defining F IS may be rewritten as: fis = H.-Hr 4.12 Similarly, if we use H T to denote the heterozygosity in the total popula- tion, the analogous equation defining F n is: Fn = H T 4.13 Consequently, 1 - F, s = H,S H s and 1 - F IT = Hi/H T . However, the remarks in Problem 4.1 also indicate that 1 - F^ = H S /H T , and so by multiplication, (l-F (S )(l-F S] ) = l-F ir 4.14 Hence, if we know both F IS and F sr , then we can obtain Fn- from Equation 4.14. The value of F s , that results from mating between remote relatives in a Population Substructure 141 subpopulation of limited size is taken up in Chapter 7 The value of F, s result- ing from mating between close relatives within a subpopulation ran be cal- culated from the pedigree of the inbred organisms by using an alternative probability interpretation of F IS defined in the next section The inbreeding Coefficient m a Probability The inbreeding coefficient F iS — which we will again call simply F unless the subscripts are needed for clarity — has an interpretation in terms of probability in addition to its interpretation in terms or heterozygosity spelled out in Equation 4.12. The probability interpretation is important in the calculation of F from pedigrees To express the inbreeding coefficient in terms of probability, imagine the two alleles of a gene present in a single inbred organism. Because the organism is inbred, the parents share one or more common ancestors. The two alleles present in the inbred organism could have been derived from the same ancestral allele by DNA replication in one of the common ancestors. In this case, the alleles are said to be identical by descent (IBD), and the genotype of the inbred organism is said to be autozygous. Conversely, the alleles may not be replicas of a single ancestral allele, in which case the alleles are not identical by descent, and the genotype is said to be allozygous. The probability inter- pretation of the inbreeding coefficient is that F is the probability that the two alleles of a gene in an inbred organism are IBD (autozygous). Note that the con- cepts of autozygosity and allozygosiry have nothing to do with the state of an allele — whether the allele is A or a, for example. The concepts are concerned only with common ancestry. If the alleles are replicas of a single allele in a common ancestor, they are autozygous; otherwise, they are allozygous. Interpreted as the probability of autozygosity, the inbreeding coefficient is clearly a relative concept. F measures the probability of autozygosity relative to some ancestral subpopulation. In defining the ancestral subpopulation, we arbitrarily assume thai all alleles present in the ancestral population are not identical by descent. The inbreeding coefficient of an organism in the present population is then the probability that the two alleles of a gene in the inbred organism arose by replication of a single allele more recently than the time at which the ancestral population existed. The ancestral population need not be remote in time from the present one. Indeed, the ancestral population, usu- ally presumed to be noninbred (F (S = 0), typically refers to the population existing just a few generations previous to the present one, and F [S in the present population then measures inbreeding that has accumulated in the span of these few generations. (Technically, any prior inbreeding is allocated to F 5T .) Because the span of time is usually short, the possibility of mutation can safely be ignored Autozygous genotypes must therefore be homozygous for some allele of the gene under consideration. On the other hand, allozy- gous genotypes can be either homozygous or heterozygous. \42 Chapter 4 Cv\2y I Autozygous and homozygous Alleles in ancestral population, all presumed to be not identical bv decent Allozygous and homozygous Allozygous and heterozygous Genotypes in present population Figure 4.7 Tn a genotype that is autozygous, homologous alleles are derived from a single DNA sequence in an ancestor, and they are Iherefoie identical by descent. In an allozygous genotype, homologous alleles are not identical by descent. As shown here, allozygous genotypes may be heterozygous or homozygous, but autozygous genotypes must be homozygous (except in the unlikely event that one allele has mutated). Figure 4.7 illustrates how the concepts of autozygosity and allozygosity are related to those of homozygosity and heterozygosity. The essential point is that two alleles can be identical by state (IBS), which means that they have the same sequence of nucleotides along the DNA, without being identical by descent. The concept of identity by descent pertains to the ancestral origin of an allele and not to its chemical makeup. Although, as shown in Figure 4.7, two distinct alleles that are identical by state (for example, two A { alleles or two A 2 alleles) may come together in fertilization and thereby make the inbred organ- ism homozygous, the alleles in the ancestral population are, by definition, not identical by descent, and so the genotype is allozygous. Similarly, although a heterozygous genotype must be allozygous (ignoring mutation), a homozy- gous genotype may be either autozygous or allozygous (see Figure 4.7). The probability interpretation of the inbreeding coefficient results in the same expected genotype frequencies as the heterozygosity interpretation set out in Equation 4.10. To verify the equivalence, we need only consider the Population Substructure 143 implications of the probability definition for a subpopulation of inbred organisms. For this purpose, imagine a subpopulation in which the organ- isms have average inbreeding coefficient F. Consider the alleles of a gene pre- sent in any one of the inbred organisms Either of two things must he true: the alleles must either be allozygous (probability 1 - F) or be autozygous {probability F). If the alleles are allozygous, then the probability that the cho- sen organism has any particular genotype is simply the probability of that genotype in a random-mating population, because, by chance, the inbreeding has not affected this particular gene, On the other hand, if the alleles are autozygous, then the chosen organism must be homozygous, and the proba- bility of homozygosity for any particular allele is simply the frequency of the allele in the subpopulation as a whole, (Because the alleles in question are autozygous, knowing which allele is present in one chromosome immediate- ly tells you that an identical allele is in the homologous chromosome ) These considerations hold regardless of the number of alleles but, to simplify mat- ters, suppose there are only two alleles A and a at frequencies p and q (with p + q = 1). The probability thai an organism has genotype AA is therefore p 2 (l - F) + pF. In this expression, the first term refers to cases in which the alleles are allozygous and the second to cases in which the alleles are autozy- gous. Similarly, the probability that an organism has genotype aa is q\\ - F) 4 (jF. Heterozygous Aa genotypes then have the frequency 2pq{\ - F) since alleles that are heterozygous must be allozygous. The genotype frequencies with inbreeding are summarized graphically in Figure 4.8. The box is divided vertically into two parts, corresponding to genes whose alleles remain allozygous in spite of the inbreeding and those whose alleles are autozygous because of the inbreeding The division is in the proportion I - F : F. Within the allozygous part of the box, the horizontal pan- els correspond to the allozygous genotypes AA, An, and aa, which are the Hardy- Weinberg frequencies. Within the autozygous part of the box, the hor- izontal panels correspond to the autozygous genotypes A A and aa, which are in the proportions p : q. The formulas for the genotype Irequencies with inbreeding aTe given in Table 4.3. Note that the genotype frequencies are exactly the same as those given in the Equations 4.10. This result shows that the autozygosity definition of F and the heterozygosity definition of F, though superficially quite different, are actually equivalent. Corresponding to the probability interpretation ol F lSl there is also a prob- ability interpretation of F ST . However, the comparison is not between homol- ogous alleles in the same organism but between homologous alleles drawn at random from the same subpopulation. Specifically, F S | is the probabilily of IBD between two alleles drawn at random from the same subpopulation. However, the inbreeding at this level is not realized as a departure from H WE but rather as differences in allele frequency among the subpopulations (Equation 4.6). The variance in allele frequency, in turn, results 111 a departure 144 Chapter 4 Population Substructure 145 Probability that <i gene remains rtMio/tfous in spite of inbreeding Probability th.it ,1 j*ene becomes twto:v$oit<; bom use of inbreeding t-F Proportional to Hardy-Weinberg frequencies Proportional to amount of inbreeding (F) Figure 4.8 Graphical representation of the effects of inbreeding on genotype frequencies. Some genes remain allozygous in spite of the inbreeding, and among these the genotype frequencies of AA, Aa, and aa are given by the Hardy-Weinberg principle. Other genes are autozygous because of the inbreed- ing, and among these the genotype frequencies of AA and an are given by the allele frequencies. There are no heterozygotes in the autozygous case because the two alleles present at an autozygous locus are, by definition, identical by descent. TABLE 4.3 GENOTYPE FREQUENCIES WITH INBREEDING Frequency in Population Genotype With inbreeding coefficient f With f = (random mating) With F = J (complete inbreeding) AA Aa air AlUvygous Aulcvygous genes genes r 2 2)>q P f from HWF in the genotype frequencies when averaged across subpopuia- tions (Equation 4 7) The probability interpretation of T ST makes the meaning of Equation 4.14 transparent It says that, in the total population, a pair of alleles will escape being IBD (1 - F n ) only if Ihey escape the effects ol mating between close relatives (1 - F, s ) and, independently, if they escape the cumu- lative inbreeding effects of mating between remote relatives due to popula- tion substructure (1 - F ST ). Cenetic Effects of inbreeding fn outcrossing species, which means species that regularly avoid inbreeding, close inbreeding is generally harmful. The effects are seen most dramatical- ly when inbreeding is complete or nearly complete. Although nearly com- plete autozygosity can be approached in most species by many generations of brother-sister mating, autozygosity of entire chromosomes can easily be accomplished in Drosophila by the sort of mating scheme shown in Figure 4.9. In this diagram, Cy (Curly wings) and Pm (Plum-colored eyes) are domi- nant mutations present in certain laboratory second chromosomes that carry several long inversions to prevent recombination. In step A, a wildtype fly is mated with Cy/Pm; four genotypes of offspring are produced because the wildtype fly is heterozygous for two different wildtype chromosomes. From each cross in A, a single Cy son is chosen and mated with Cy/Pm. This step is shown in part B. Three classes of progeny are produced (because Cy/Cy is lethal); moreover, from each mating the Q//+ progeny ail carry wildtype second chromosomes that are IBD because they originated by replication of a single chromosome in the previous generation. In the cross in part C, the Cy/+ progeny from part B are mated among themselves; the expected progeny are +/+ and Cy/+ in the ratio % : %, and the wildtype homozy- gotes have second chromosomes that are IBD. For chromosome 2, these flies are completely inbred. In the mating D, Q//+ flies carrying two different wildtype chromosomes are crossed; again trie expected progeny are +/+ and Cy/+ in the ratio ] / 3 : 2 / 3 , but in this case the wildtype flies are heterozygous for different copies of chromosome 2 and are not completely inbred. For the matings in part C and part D, an estimate v of the viability (abili- ty to survive) of the +/+ genotype, relative to that of the Cy/+ genotype, is given by 2 x Number (+/+) 1 + Number (CV/+) 4 15 where Number (+/+) and Number (Q//+) are the counts of wildtype and Curly offspring, respectively {Haldane 1956). The addition of 1 to the'denom- inator makes the estimate of v almost unbiased. When the total number of 146 Chapter 4 offspring is large, v is essentially equal to two times the number of wild type offspring divided by the number of Curly offspring Results of an experiment using the procedure in Figure 4 9 are shown in Figure 4.10. It is evident that the homozygous genotypes (shaded histogram) are relatively poor in viability. In fact, about 37% of the homozygotes are lethal. Moreover, among the homozygotes that have viabilities within the normal range of heterozygotes (open histogram), virtually all can be shown to have reduced fertility (Sved 1975; Simmons and Crow 1977). Inbreeding so close as to make entire chromosomes homozygous is rare in outcrossing species, except in the kind of experiment in Figure 4.9, but the effects are clearly very harmful and provide a new dimension of genetic diversity In the case of allozymes, genelic diversity results from common alleles that do not perceptibly impair viability or fertility when homozygous In the case of inbreeding, the effects are mainly due to rare alleles that are severely detri- mental when homozygous. (The fact that the alleles are rare is shown by the small proportion of lethal or near-lethal heterozygotes.) Figure 4.10 shows that natural populations of Drosophila contain considerable hidden genetic variation in the form of rare deleterious recessive alleles. Detrimental effects of inbreeding, called inbreeding depression, aTe lound in virtually all outcrossing species, and the more intense the inbreeding, the more harmful the effects. Inbreeding in human beings is also generally harmful, but the effect is difficult to measure because the degree of inbreeding is less than that in experimental organisms; the effects may also vary from population to population. Nevertheless, chil- dren of first-cousin matings are, on the average, less capable than nonin- bred children in any number of ways (for example, higher rate of mortality, lower 1Q scores) — although it should be emphasized that many such children are within the normal range of abilities and some are quite gifted As in most organisms, inbreeding depression is largely due to the Figure 4.9 Mating scheme to extract wildtype chromosomes (in this case, the second chro- mosome) from populations of Drosophila melanogastcr Cif (Curly wings) and Pm (Plum eye color) are dominant mutations contained in certain special laboratory chromosomes that have multiple inversions to prevent recombination. From each mating of the type in part A, a single Ci/ son (containing one wildtype second chromosome) is selected. This son is backcrossed (part B) in order to reproduce many replicas of the second chromosome; the Cy progeny are selected foi further mating, and the other progeny are discarded. Brother-sister mating as in part C is expected to produce '/, Cy/Ci/, V 2 Cij/+, and V4 +/+ zygotes (where + denotes the wildtype second chromosome), the C1//C1/ zygotes do not survive, and so the surviving off- spring are % G//+ (Curly wings) and '/i +/+ (wildtype straight wings). Mating as in part D, between a female containing one wildtype second chromosome and a male carrying a differ- ent one, are also expected to produce 2 A Cfir/y-winged and V, straight-winged progeny How- ever, in mating C, the straight-winged flies are homozygous for a single wildtype second chromosome; whereas; in mating D, the straight-winged flies are heterozygous for two differ- ent wildtype second chromosomes Population Substructure 147 (A) Male and select single Curly- winged son Select, - Q^O^CZj WtlfilifpeQ I Cv/Pmrf CHi/iz-winged Q cf- <*9cf* ' r»>9cf c *9cf •9cf (B) Bnckcross a single Cy male from (A) and select Curly sons and daughters, which are heterozygous Select C y9</ Q/Cvfdies) PmQtf Cy/PmQtf (C) Mate heterozygotes for same wildtype chromosome and count proportn of non-Q<Wy offspring, Count -*•". + 9d C ?9C/ C *9CT Oldies) (non-Cy) tvpect \ non-Ci/ 2 i Ci/ (D) Mate heterozygotes fen different chromosomes and count proportion of non-Curly offspring Count — ' _ __ + 9cf C *9C/ C v9d Cy/Cy (dies) (ntm-Cy) Expect \ non-Ci/ " 7" i Cm 148 Chapter 4 05 0.15 35 55 075 0.95 Viability (relative (0C1//+) 1.15 135 Figure 4. 1 Viability distributions of wildtype homozygotes (shaded area) and wildtype heterozygotes (black outline) of second chromosomes extracted from Dmsophila melanogaster according to the mating scheme in Figure 4 9. The histograms depict results of testing 691 homozygous combinations and 688 het- erozygous combinations. Note that, in this sample, nearly 37% of the wildtype chromosomes axe lethal when homozygous, and many more have viabilities substantially below normal. (Data from Mukai et al. 1974.) increased homozygosity of rare recessive alleles, and so inbreeding effects in human beings are seen most dramatically in the increased frequency of genetic abnormalities due to harmful recessive alleles among the children of first-cousin matings. The increased frequency of such conditions results from the genotype frequencies given in Table 4.3. If a denotes a rare dele- terious recessive allele, Vi6 then, among the children of first-cousin mat- ings, the frequency of art is q 2 (l - VifJ+rj (V™) because, for these children, F = V](, , as will be shown in the next section. On the other hand, with ran- dom mating, the frequency of recessive homozygotes is q 1 . Thus, the risk of an affected offspring from a first-cousin mating relative to that from a mating of nonrelatives is given by 7'(i-^(«.L a9375+ ao625 4.16 For example, when q = 0.01, the increased risk is approximately 7; that is, a first-cousin mating has seven times the chance of producing a homozygous recessive child as compared to a mating between nonrelatives when the fre- quency of the harmful recessive allele is 01. There is clearly a dramatic Population Substructure 149 inbreeding effect— and the rarer the frequency of the deleteiious recessive allele, the greater the effect. PROBLEM 4.8 Relative to the risk with random mating, calculate the risk of a homozygous recessive offspring from a mating of second cousins (F = Vtt) when the recessive allele frequency is q = 0.01. ANSWER In general, the relative risk is given by [q 2 (l - F) + qF]/q 2 = (1 - F) + F/q. For F = i/«, this becomes 0.9844 + 0.0156/ij, and the value for q - 0.01 is approximately 2.5. Calculation of the Inbreeding Coefficient from Pedigrees Computation of F from a pedigree is simplified by drawing the pedigree in the form shown in Figure 4.11A, where the lines represent gametes con- tributed by parents to their offspring. The same pedigree is shown in con- ventional form in Figure 4.11B. The organisms in gray in part B are not rep- resented in part A because they have no ancestors in common and therefore do not contribute to the inbreeding of the organism denoted I. The inbreed- ing coefficient F, of I is the probability that 1 is autozygous for the alleles of (B) -A 6 Figure 4. 1 1 (A) Convenient way to represent pedigrees for calculation of the inbreeding coefficient. In this case, the pedigree shows a mating between half- first cousins. (B) Conventional representation of the same pedigree as in part A. Squares represent males, circles represent females, and the shaded organisms in part B are not* depicted in part A because they do not contribute to the inbreed - »ng of the inbred organism designated I, 150 Chapter 4 an autosomal gene under consideration. The first step in calculating F, is to locate all the common ancestors in the pedigree, because an allele could become autozygous in I only it it were inherited through both of I's parents from a common ancestor; in this case, there is only one common ancestor, namely, A. The next step in calculating F,, which is carried out for each com- mon ancestor in turn, is to trace all the paths of gametes that lead from one of I's parents back to the common ancestor and then down again to the other parent of 1. These paths are the paths along which an allele in a common ancestor could become autozygous in I. In Figure 4.11 A, there is only one such path: DBACE, in which the common ancestor is underlined for book- keeping purposes, an especially useful procedure in complex pedigrees. The third step in calculating F, is to calculate the probability of autozy- gosity in I due to each of the paths in turn. For the path DBACE, the reason- ing is illustrated in Figure 4.12. Here the black dots represent alleles transmitted along the gametic paths, and the number associated with each step is Ihe probability of identity by descent of the alleles indicated. For all steps except that around the common ancestor, the probability is V 2 because, with Mendelian 'segregation, the probability that a particular allele present in a parent is transmitted to a specified offspring is V 2 . To understand why V 2 (l + F A ) is the probability associated with the loop around the common ancestor, denote the alleles in the common ancestor as a, and a 2 . These sym- bols are used to avoid confusion with conventional allele symbols designat- ing functional types of alleles, such as A for dominant and a for recessive. The pair of gametes contributed by A could contain dice,, a 2 a 2 , ot]a 2 , or a 2 ai, each with a probability of 'A because of Mendelian segregation. In the first two cases, the alleles are clearly identical by descent, in the second two cases, 1/2(1 + F A ) Figure 4.12 Loops for the pedigree in Figure 4.11 A, showing probabilities that designated alleles (solid dots) are identical by descent Each loop is independent of the others, so their probabilities multiply thus, the inbreeding coefficient of organism 1 is F r = ("A/'O + F A ), where F A i of the common ancestor. represents the inbreeding coefficient Population Substructure 151 the alleles are identical by descent only if a, and o 2 are already identical bv descent, which means that A is autozygous. The probability that A is autozy- gous is, by definition, the inbreeding coefficient of A, F A . Hence, the proba- bility for the step around the common ancestor A is '/ 4 + '/ 4 +. i/ 4 F A '+ '/, f A = 1/ + '/ 2 Fa = V 2 (l + F A )- Because each of the steps in Figure 4.12 is independent of the others, the total probability of autozygosity in 1 due to the path through A is '/ 2 x V 2 x i/ 2 (l + F A ) x i/ 2 x % or («/ 2 ) , (l + F A ). Note that the exponent on the '/ 2 is simply the total number of ancestors in the path. In general, if a path through a common ancestor A contains i individuals, the probability of autozygosity due to that path is w 2 m + F A ) Thus, the inbreeding coefficient of f in Figure 4 11 A is ('A) s (l + F x ) Assuming that A is not inbred (F A = 0), the inbreeding coefficient of I reduces to (v 2 y = '/ 32 . In pedigrees of greater complexity, there is more than one common ances- tor and there may be more than one path through any of the common ances- tors. The paths are mutually exclusive because autozygosity due to an allele inherited along one path excludes autozygosity due to an allele inherited along a different path. Thus, the total inbreeding coefficient is the sum of the probabilities of autozygosity due to each path considered separately The whole procedure for calculating F is summarized in an example of a first- cousin mating in Figure 4.13. In a first-cousin mating, there are two common Pedigree Paths GDACE Contribution to F,: Wif{\ + T A > GDBCE Figure 4.13 On the left is a pedigree of individual I, the offspring of ,i first- cousin mating. On the right are the two paths through common ancestors (Heavy lines) used in calculating the inbreeding coefficient of i Below each path is me contribution to F, due to that path, calculated as in Figure 4 12 Fadi path ]s mutually exclusive of the others, and so their probabilities add rhu.s, llw tola inbreeding coefficient of I is the sum of the two separate contributions If / A = Fn - 0, then F , = V lfv K A j 52 Chapter 4 r ancestors (A and B) and two paths (one each through A and B). The total inbreeding coefficient of 1 is the sum'of the two separate contributions shown in Figure 4.13. If A and B are both noninbred, then F A = F B = 0, and so F t - ( i/ 2 f + (i/ 2 ) s = V 16 ; this result is the probability that I is autozygous at the spec- ified locus. Alternatively, F, can be interpreted as the average proportion of all genes in 1 in which the alleles present are autozygous. In general, for any autosomal gene, the formula for calculating the inbreeding coefficient F, of an inbred organism I is Fi-lf^V™ 4.17 in which the summation I over A means summation over all possible paths through all common ancestors, / is the number of organisms in each path, and A is the common ancestor in each path. PROBLEM 4.9 The accompanying pedigree depicts two generations of brother-sister mating. Calculate the inbreeding coefficient of I, assuming that none of the common ancestors is inbred. (Altogether, there are four common ancestors and six paths.) ANSWER F, = <V 2 ) 3 (1 + Fc> + (V/d + F D ) + (Wfl + Fa) + (V 2 ) 5 + F A ) + (V 2 ) s (l + F B ) + ( l &) a (l + F &)- When the common ancestors are assumed to be noninbred, then F A = H = F c = F D = 0, and so F t * 3 / 8 - Population Substructure 153 Regular Systems of Mating In plant and animal breeding, it is often important to know how rapidly the inbreeding coefficient increases when a strain is propagated by a regular sys- tem of mating, such as repeated self-fertilization, sib mating, or backcrossing to a standard strain. The reasoning involved in calculating the inbreeding coefficient for any generation is illustrated in Figure 4.14 for repeated self- fertilization. In this figure, the labels f - 1 and / refer to the inbred organisms after / - 1 and t generations of self-fertilization. The loop around the ances- tor in generation f - 1 designates the probability that the two indicated alleles are identical by descent. Here the formula in Equation 4.17 applies with only one path and only one ancestor in the path, and so F, = ( l / 2 )V + F M ), where Ft is the inbreeding coefficient in generation t. This equation is easy to solve in terms of the quantity 1 - F f , which is often called the panmictic index, panmixia being a synonym for random mating. Multiplying both sides of the equation for F t by -1 and then adding +1 to each side leads to 1 - F, = i - y 2 (i + f m ) = l - >/ 2 - y 2 F M = i/ 2 (i _ f m) , or i-F,=(0d-Jb) 4.18 where F is the inbreeding coefficient in the initial generation when the repeated self-fertilization begins. Self-fertilization therefore leads to an extremely rapid increase in the inbreeding coefficient. When F = 0, then Fi = Vi, F 2 = y 4/ F 3 = %, F 4 = 15 / 16 , and so on. The increase in F under self- fertilization and several other regular systems of mating is shown in Figure 4.15. Many plants reproduce predominantly by self-fertilization, including crop plants such as soybeans, sorghum, barley, and wheat. As expected of wi + *;_,.) Figure 4.14 Increase in F resulting from continued self-fertilization The organism in generation r is the offspring of self -fertilization of the organism in generation ( - 1 . The loop shows that F, = 1/2(1 + F, . ,). 154 Chapter 4 /Repeated backcrossirtg / to inbred strain 8 10 12 Generations (f) Figure 4. 1 5 Theoretical increase in the inbreeding coefficient F for regular systems of mating: selfing, sib mating, half-sib mating, and repeated backcross- ing to a single organism from a random-bred strain In each case, the initial value of F is assumed to be F fl = 0. highly self-fertilizing species, each plant is highly homozygous for alleles such as those determining allozymes. Yet the proportion of polymorphic genes is comparable to that found in outcrossing species. Polymorphisms are found because self-fertilization does not eliminate genetic variation; it simply reorganizes genetic variation into homozygous genotypes. On the other hand, self-fertilizing species do contain fewer deleterious recessives than do outcrossing species, presumably because the increased homozygosity per- mits harmful recessives to be eliminated from the population by natural selection. One other important point about naturally self-fertilizing species: The high homozygosity of all genes implies that recombination rarely results in new gametic types not already present in the parent. Therefore, predomi- nance of selfing has the effect of retarding the approach to linkage equilibri- um because the approach to linkage equilibrium is through recombination in double heterozygotes (AB/ab and Ab/aB in the case of two alleles at each locus); with extreme inbreeding, such double heterozygotes are rare. Indeed, the most extreme examples of linkage disequilibrium have been found in pre- dominantly self-fertilizing species such as barley {Hordeum vulgare) and wild oats (Avena barbata). Barley, which regularly undergoes more than 99% self-fertilization, pro- vides an extreme example of linkage disequilibrium between two unlinked esterase genes (Clegg et al. 1972). A population that had originated as a com- plex cross was maintained for 26 generations under normal agricultural con- ditions without conscious selection. The population was polymorphic for Population Substructure 1 55 two alleles B r and B 2 of an Esterase-B gene and also polymorphic for two alle- les D\ and D 2 of an Esterasc-D gene. The gametic types were found in the fol- lowing proportions. For all practical purposes, these numbers also refer to homozygous genotypes because there is such close inbreeding. B,D, 1501 (1642.6) B r D 2 754 (613.7) B 2 D } B 2 D 2 720 74 (5771) (215.6) (The numbers in parentheses are the expected numbers based on the assumption of linkage equilibrium, calculated as in Chapter 3.) The y} value in this case is 172.7 with one degree of freedom. The associated probability is much les& than 0.0001, and so there is undoubtedly linkage disequilibri- um. For the above data, the linkage disequilibrium parameter (Equation 3.9) is D = -0.046, which is about 66% of its theoretical minimum. One of the dramatic successes of plant breeding has come from the crossing of inbred lines to produce high-yielding hybrid corn. Yield of a genetically heterogeneous, outcrossing variety of corn can be improved by selecting the plants with the highest yields in each generation to be the progenitors of the next generation; such artificial selection results in only gradual improvement, however (see Chapter 9). If a large number of self-fertilized lines are estab- lished from a heterogeneous population, each line declines in yield as inbreed- ing proceeds, owing to, the forced homozygosity of deleterious recessives. Many lines become so inferior that they have to be discontinued. Self-fertilized lines are not likely to become homozygous for exactly the same set of deleteri- ous recessives, however, and when different lines are crossed to produce a hybrid, the hybrid becomes heterozygous for these genes Alleles favoring high yield in com are generally dominant, and there may also be genes in which the heterozygous genotypes have a more favorable effect on yield than do the homozygous genotypes; in any case, the hybrid has a much higher yield than either inbred parent. The phenomenon of enhanced hybrid performance is called hybrifl vigor or heterosis. In practice, inbred lines are crossed in many combinations to identify those that produce the best hybrids. Yields of hybrid corn are typically 15 to 35% greater than yields of outcrossing varieties, and the successful introduction of hybrid corn has been remarkable. Virtually all corn acreage in the United States today is planted with hybrids, as compared to 4% of the acreage in 1933 (Sprague 1978). ASSORTATIVE MATING When choice of mates is based on phenotypes, mating is said to be assorta- tive Most assortative mating is positive assorialive imifm$; this term means 1 56 Chapter 4 that mating pairs have, on the average, mo* smnlo r phenotypes *™ **£^ ed with random mating. The qualifier "on the average is important. Even when mating is random, some mating pairs are phenoty pl cal!y ^.!ar jnd so positive assortative mating refers only to those situations .n which mating P Xrs are phenotypically more similar than would be expected by chance enC There r are also examples of negative assortative matingsomelxmes called Assortative nmting-m which mating pairs are more dissimilar than expect- ed by chance. One case of negative assortative mating is a polymorphism known as heterosty.y found in most species of primroses (Pr '*m™**™ relatives. The heteroslyly polymorphism refers to the relative lengths of he styles and stamens in the" flowers (Figure 4.16) (In ibotanical ™ology th style is a stalk bearing the stigma, which is the female organ >^t recedes pollen; the stamen is the male organ bearing anthers, in wh,ch he pol en is produced.) Most populations of primroses contain approximate y equal pro portions of two types of flowers, one known as pm which has a taH sty* .and short stamens, and the other known as thnm, which has .a ^°$**£f stamens. In heterostyly, insect pollinators that work h,gh on the flowers pick up mostly thrum pollen and deposit it on P in stigmas, whereas po lbna ore that work low in the flowers pick up mostly pin pollen and deposit it on (A) Pin (|)) Thrum Style Flqure 4.16 Diagrams of cross sections of (A) pin and (B) thrum flowers of the primrose Pmm/Ahe pin flowers have a long style and short stamens .the KStavere have a short style and long stamens. The differences ,n flower myology assist in the maintenance of negative assortative mating mediated by insect pollinators Population Substructure 157 thrum stigmas. Negative assortative mating therefore takes place because pins mate preferentially with thrums. Additional floral adaptations facilitate the negative assortative mating. For example, pollen grains from pin flowers fit the receptor cells of thrum stigmas better than they do their own, and pollen grains from thrum flowers germinate better on pm stigmas than they do on their own. The pollination biology of flowering plants also provides examples of positive assortative mating. For example, when the length of time in which any plant flowers is short relative to the total duration of the flowering sea- son, then plants that flower early in the season are preferentially pollinated by other early flowering plants, and those that flower late are preferentially pollinated by other late flowering ones. Thus, there is positive assortative mating for flowering time. In human beings, positive assortative mating is observed for height, IQ score, and certain other traits, although assortative mating varies in degree in different populations and is absent in some. As might be expected, positive assortative mating is found for certain socioeconomic variables. In one study in the United States, the highest correlation found between married couples was in the number of rooms in their parents' homes. Negative assortative mating is apparently quite rare in human populations. In certain species of Drosophih, a curious type of nonrandom mating is a phenomenon called minority male mating advantage, in which females mate preferentially with males with rare phenotypes. For example, in a sludy of experimental populations of D. pseudoobscum containing flies homozygous for either a recessive orange eye-color mutation or a recessive purple eye-color mutation, Ehrman (1970) found that, when 20% of the males were orange, the orange-eyed males participated in 30% of the observed matings; conversely when 20% of the males were purple, the purple-eyed males participated in 40% of the observed matings. The consequences of positive assortative mating are complex. They depend on the number of genes that influence the trait in question, on the number of different possible alleles of the genes, on the number of different phenotypes, on the sex performing the mate selection, and on the criteria for mate selection. Traits for which mating is assortative are rarely determined by the al leles of a single gene, however. Most such traits are polygenic, so rea- sonably realistic models of assortative mating tend to be rather complex Here we should note one obvious, qualitative consequence of positive assor- tative mating: since like phenotypes tend to mate, assortative mating gener- ally increases the frequency of homozygous genotypes in the population at the expense oMieterozygous genotypes, and thus the phenorypic variance in I he population increases. (Negative assortative mating generally has the opposite effect.) 1 58 Chapter 4 SUMMARY Species that are spread over a large geographical area are usually divided into subpopulations. Matings between organisms within the same subpopu- lation are more likely than matings between organisms in different subpop- ulations. Geographical subdivision of a population is called population sub- structure The genetic consequences ol population substructure result from the fact that the frequencies of alleles may differ from one subpopulation to the next. When the allele frequencies differ, the average heterozygosity among the subpopulations is smaller than that expected with random mat- ing in the total population. Many populations are subdivided into groups within larger groups, a kind of structure called a hierarchical population structure. The F statistics are a quantitative measure of the reduction in het- erozygosity at various levels in a population hierarchy. For example, F SR is the proportionate reduction in average heterozygosity among subpopula- tions (S) as compared to that expected with HWE within regions (R): Fsk = (Hr - Hc,)/H R . Similarly, F RT is the proportionate reduction in average heterozygosity among regions (R) as compared to that expected with HWE in the total population (T) F R1 = (H r - H R )/Hj. The fixation index F ST com- bines the effects due to subdivision into subpopulations within regions and regions within the total population: F ST = (H T - H S )/H T . Generally speaking, an F statistic with a value smaller than 0.05 indicates little genetic differenti- ation, a value from 0.05 to 0.15 indicates moderate genetic differentiation, from 0.15 to 0.25 indicates great genetic differentiation, and above 0.25 indi- cates very great genetic differentiation among subpopulations When subpopulations undergo fusion and random mating, the deficien- cy of heterozygotes is eliminated. Said another way around, the excess of homozygous genotypes in a subdivided population is eliminated by popu- lation fusion and random mating. This effect of population fusion is called the Wahlund principle. Quantitatively, the Wahlund principle implies that population fusion and random mating will cause a reduction in the frequen- cy of any homozygous genotype by an amount equal to the variance in allele frequency among the original subpopulations. For two alleles, the Wahlund effect is related to the fixation index by the relation F ST = <J 2 /(p x q ). In terms of the fixation index, the average genotype frequencies across subpopula- tions are: AA with average frequency p 2 {\ - F ST ) + pF ST , Aa with average fre- quency 2 pq (1 - F S j), and aa with average frequency q 2 {\ - F ST ) *- Wsi- Despite the departure from HWE when genotype frequencies are averaged across subpopulations, within each subpopulation mating is random and the genotype frequencies are in HWE for the allele frequencies in the sub- population. Inbreeding means mating between relatives. The most important effect of inbreeding is that replicas of a single allele in a common ancestor may be transmitted down both sides of the pedigree and come together in fertiliza- Population Substructure 159 Hon to produce the inbred organism. In such a case, the inbred organism is said to be autozygous, and the alleles are identical by descent (I0D) Other- wise the inbred organism is allozygous. The inbreeding coefficient F is the probability that the two homologous genes in an inbred organism are IBD. With close inbreeding among parents with relatively recent common ances- tors, the value of F can be calculated from elementary probability considera- tions using the formula F = E (>/ 2 )'(l + F A ), where the summation is over all paths from one parent to the other through each common ancestor, i is the number of organisms in the path, and F A is the inbreeding coefficient of the common ancestor in the path. Amongorganisms in which the inbreeding coefficient is F, the genotype frequencies of a gene with two alleles are, for AA, p\\ - F) + pF; for Aa, 2pq{l - F); and for aa, q 2 (l - F) + qF Hence, one of the most important consequences of close inbreeding is an increased risk of homozygosity of rare recessive alleles— q 2 {l - F) + qF for inbred organisms versus q for noninbred organisms. In human populations, a substantial pro- portion of children affected with rare, homozygous recessive genetic diseases have first-cousin parents, although first-cousin mating is infrequent. Population substructure results in an accumulation of inbreeding because mating pairs within subpopulations will often have remote relatives in com- mon, even when mates are chosen at random. Thus, the inbreeding coeffi- cient F resulting from nonrandom mating within a subpopulation should be designated F JS . The total inbreeding resulting from nonrandom mating combined with all levels of population substructure is given by the expres- sion (1 - F IT ) = (1 - F IS ) x (1 - F«rr). PROBLEMS 1. Two diploid random mating populations have allele frequencies q + e and q - e for a recessive allele of a gene. What are the frequencies of homozy- gotes before and after population fusion? 2. Show that F IT = F IS +■ F JT - F (S F| T and interpret the expression 3. Calculate F ST among the three random-mating populations below based on the specified allele frequencies. What is the maximum value of F sr in this situation? Population Population J Population 2 Population 3 Allele 1 Allele 2 Allele 3 01 03 0.6 0.2 3 0.5 0.3 3 04 4. Calculate F IS , F ST , and F ( | for the populations with the genotype frequen- cies shown in the following table: 1 60 Chapter 4 Population Substructure 1 61 Population 1 Population 2 Genotype A A Aa 0-056 288 656 0.072 0.256 0.672 5. Suppose two subpopulations with equal allele frequencies of two h^ genes have an amount of linkage disequilibrium that is equal but opp,*. in sign. What is the amount of linkage disequilibrium in a popu^, formed by mixing equal numbers of individuals from the two populate 6. Show that p 2 (l - F) + pF = p 2 + pqF = p-(l- F)pq, when q = 1 - p. 7. With two alleles and p = V z , what are the expected genotype frequ e n fl in a random mating population and among the offspring of first cousir How great is the decrease in heterozygosity in the inbred population r> ative to the random mating population? 8. If the frequency of an autosomal recessive disorder is 1/1600 arno- unrelated parents, what is the expected frequency among the offspring first cousins? 9. For a recessive allele at frequency q in a population in which one pe^t- of the matings are between first cousins, but otherwise occur at randr the proportion of affected individuals having first-cousin parenl* (1 + 15<j)/(l + \599q). Calculate for q = 0.1, 0.05, 0.1, 0.005, and OP Interpret the result of the equation when q = 1. 10. In a population of monoecious plants in Hardy-Weinberg proportions) two alleles with allele frequency p, what is the variance in allele frequr cy among plants? What is the variance il the population were cornpH inbred? If a random mating population were to undergo self-fertilizal' - what would the variance be when the inbreeding coefficient equals F 11. The measure of genetic divergence G ST is very useful for multiple alle in multiple subpopulations. G ST can be defined as (J s - J T )/(1 -/?), wl* p, is the frequency of the /In allele, J s = IAvg(p,) and J r = I[Avg(p,)f (\i 1987). The summation means summation over all alleles, and Avg mer the average over all subpopulations. For the random mating populali below, calculate F S t and Gst- Population 1 Population 2 Allele 1 Allele 2 Allele 3 02 3 0,5 0.6 0.0 0.4 12. G ST for multiple alleles is actually a weighted average of Fst va,l! G S r = £Pr(l - p,)F ST(r) /Ep,(l - p,), where the summation is over all all? p, is the average frequency of the ith allele among the subpopula^ and F ST (o is the F ST value for the ith allele calculated as if the ge ne only two al leles with frequencies p, and 1 - p, in each stipulation Cal- culate Fg,,,, for each allele ,n the preceding problem and confirm numeri- cally that the weighted average equals G sr . 13. In calculating F from pedigrees for X-linked genes, why are paths with two or more consecutive males not counted? 14. What is the coefficient at relationship between / and / in the accompanv- ing pedigree, where I and / are the offspring of a pair of first cousins (A, B) mated with another pair of first cousins (C D)? J ~ - l? ' Yn r^L 15. Assuming F A = F B = 0, calculate the inbreeding coefficient for each of the individuals C - / in the accompanying pedigree. t 16. If a population is maintained by self-fertilization in even-numbered gen- erations and by random mating in odd-numbered generations, what hap- pens to the inbreeding coefficient? 17. For a gene with two alleles and p = 0.3, what are the expected genotype frequencies after five generations of sib mating? What are the expected i a f A T 0,y P e fre( I uen cies after one additional generation of random mating? 18. What is the inbreeding coeff.cient in a population of size 50 that under- goes ^ a. 47 generations of random mating followed by three generations of sib mating? > b. 50 generations of random mating? 19. In gametophytic self-incompatibile plants, the pollen can only fertilize ovules whose genotype has neither allele borne by the haploid pollen. In 162 Chapter 4 a plant population at equ.librium with three gametophytic «»-™^" patibility alleles, what is the probability that a pollen gram will land on a 20. ZvwayhybHd'corn is produced by crossing two d-He^t inb«d_Linej; three-way hybrids are produced by crossing a two-way hybrid w.th an unrelated inbred; and four-way hybrids are produced by crossing two different two-way hybrids. What is the inbreeding coefficient of the off- spring of randomly mated two-way, three-way, or four-way hybrids? (Hint: Consider the allele frequencies in gametes.) Denve a recurs,on equation for F, for repeated parent-offepnng mating (see pedigree), and calculate F, for I = to 5. 21 22. Derive a recursion equation for F, for repeated backcrossing to a singe noninbred individual A (see pedigree). Calculate F, for t = to 5 and the equilibrium value. CHAPTER 5 Sources of Variation Mutation Infinite Alleles Model Neutral Mutations Recombination Migration Transposable Elements j eneti& includes several processes that create new types of genet- ic variation in populations or that allow for the reorganization of previously existing variation either within genomes or among subpopulations. The ultimate source of genetic variation is mutation, by which we mean any heritable change in the genetic material. Mutation there- fore includes a change in the nucleotide sequence of a single gene as well the formation of a chromosome rearrangement, such as an inversion or a translo- cation. Recombination brings mutations of different genes together into the same chromosome. Migration enables mutations to spread among subpopu- lations. A transposable element is a DNA sequence able to replicate and insert into any of a large number of sites in the genome. By insertion in or near a gene, a transposable element can alter the level or pattern of gene expression; recombination between transposable elements can result in a chromosome rearrangement, for example, an inversion In this chapter, we consider the processes by which genetic variation is created. MUTATION Mutation is the ultimate source of genetic variation for evolutionary change However, most wild type genes mutate at a very low rate, typically in the range from 10 4 to HT 6 new mutations per gene per generation. Even a low mutation rate can create manv new mutant alleles because, in a large popu- lation, each of a large number of genes is at risk of mutating In a population 163 164 Chapters of size N diploid organisms, there are 2N copies of each gene, each of which cam mutate in any generation. Mulalions arc rare, but in a large population there are many alleles at risk. For example, if the mutation rate (probability of mutation) is I0~ Q per nucleotide pair per generation, then in each human gamete, the DNA of which contains Iff nucleotide pairs, Ihere would be an average of three new mutations in each generation; each newly fertilized egg would carry, on the average, six new mutations. The present-day human population of approximately 6 billion people would therefore be expected to carry approximately 36 billion new mutations that were not present even one generation earlier. Irreversible Mutation Although mutation may create a new allele, the initial frequency of the mutant allele must be very small if the population size is large. A single new mutant allele in a diploid population of size N has an initial frequency of 1 /2N. New mutations in subsequent generations may augment the number of mutant alleles, but recurrent mutation alone increases the allele frequen- cy of the mutant very slowly. Consider an example in which A is the wild- type allele and a the mutant form If there is exactly one new mutation per generation, then the allele frequency of a increases according to the series 1/2N, 2/2N, 3/2IV, . . . and, if N is large (lor example, N = W\ then the increase is very slow indeed. Hence, the tendency for allele frequency to change as a result of recurrent mutation (mutation pressure) is very small. On the other hand, the cumulative effects of mutation over long periods of time can become appreciable. A useful model for thinking about mutation is the Hardy- Weinberg model of Chapter 3, but with mutation permitted. For the moment, we focus on muta- tions that have so little effect on the ability of the organism to survive and reproduce that natural selection does not appreciably influence their frequen- cy. We will also assume that mutation is irreversible, which means that a cannot reverse-mutate to A To avoid complications resulting from change in allele fre- quency due to chance, we will assume a population that is infinite in size Consider a gene with two alleles, A and a, and suppose that A mutates to a at a rate of p mutations per A allele per generation. In other words, each A allele has a probability of u of mutating to a in any generation. We will sym- bolize the allele frequency of A as p and that of a as q and keep track of gen- erations with subscripts Hence, p { and q t are the allele frequencies of A and a, respectively, in the fth generation, where t = 0, 1, 2, . . . In any generation, p, + q, = 1 because A and a are the only alleles considered. Next we will deduce a formula for the allele frequency p, in terms of the allele frequency p M in the previous generation. In generation f, p, includes all the A alleles in generation t that did not mutate in that generation, and so Pt = Pt-\ x (1 - u) Sources of Variation 1 65 However, by the same reasoning, p M includes all A alleles in generation [- 1 that did not mutate in that generation, and so />, , = p,_ 2 x (1 - p). Sub- stituting this equation into the one above yields Pl=Pl-2* (1 -(->)' Continuing in the same manner leads eventually to The effect of mutation pressure on allele frequency is illustrated in Figure 5.1 for the case u = 10" 4 . The allele frequency of A decreases very slowly, almost linearly at first because the governing term in Equation 5 1,(1- u)', is approximated by 1 - pi when I is sufficiently small. After 1000 generations, the allele frequency of A is still 0.90; however, at / = 10,000 generations, p, = 0.37; and at t = 20,000 generations, p, = 0.14. One instructive w'ay to analyze Equation 5.1 is to consider the time required to reduce the allele frequency of A by half. To find the "half-life" of the process, set p, = 0.5 x p ; this relationship implies that 0.5 = (1 -p)'. Taking logarithms of bothVides, we obtain t xn = In (0.5)/ln (1 - p) = 0.6931 /u ' ' In the example in Figure 5.1, t l/2 = 6931 generations. A decrease in p by a factor of 10 increases f 1/2 accordingly to approximately 69,310 generations for u = 10 5 and to approximately 693,100 generations for p = 10" 6 . The fact 10,000 20,000 30,000 Time (f, in generations) 40.01X1 Figure 5.1 Change in frequency under mutation pressure, In this example, an allele A mutates to a at a rate of u = 1 x 10 4 per generation, p, is the allele fre- quency of A in generation t. We assume that p = 1. With the given value of p, the allele frequency decreases by half every 6931 generations. 166 Chapters that mutation pressure is a weak force for changing allele frequency is illus- trated by the long half-lives calculated for realistic values of the mutation rate. As noted with reference to Equation 5.1, the approximation p, = po(l - uf) is quite accurate for small values of f. With respect to the allele frequency of the mutant allele a, the approximation can also be written as q t = q + uf, provided that q is small. This approximation implies that the allele fre- quency ol the a allele increases linearly with time with a slope equal to u. Because u is small, however, the linear increase in % «s difficult to detect experimentally except in very large populations. A large population size can be attained in a bacterial chemostat, which is a device for maintaining a population of bacteria in a continuous state of growth and cell division (Figure 5.2). The linear increase in q, from mutation pressure observed in a Nutrient medium input Air bubbles Bacterial growth chamber Air input Figure 5.2 Diagram of a bacterial chemostat. Nutrient medium drips in at the top, but a constant volume is maintained by means of an overflow siphon The air coming in at the bottom provides oxygen. At the steady state, the rate of inflow of nutrient equals the rate ol outflow. Cells within the chemostat are in a continuous state of division, but the population does not increase in size because, in any interval of time, the number ol new cells produced by division is balanced by the number washed out through the siphon. Sources of Variation 1 67 chemostat is shown in Figure 5.3. Note the abrupt increase in mutation rate (indicated by the increase in slope) shortly after the addition of caffeine, a bacterial mutagen. 6 x W' 6 - 4xuT 6 2x NT 6 1 Caffeine / added / 9 — - "" • 1 ■ .. 1 ... .-J L 8 J2 1ft Time (t, in generations) 20 Figure 53 Estimation of mutation rate in a bacterial chemostat This exam- ple concerns the rate of mutation of a gene in Escherichia coli that confers resis- tance to infection by the bacteriophage T5 The frequency q, is the frequency of T5-resistant cells after t generations of growth. The mutation rate is estimated as the slope of the straight-line segments. Prior to the addition of caffeine, the slope was u = 7.2 x 10 B per generation. After addition of caffeine at a con- centration of 150 rng/1, the slope increased about tenfold to \i = 66 x 10 B per generation. In this experiment, the generation time was 5 5 hours (From Novick 1955.) PROBLEM 5.1 A genetic factor has been described in Drosophila mauritkna that results in the spontaneous deletion of the transpos- able generic element mariner at a frequency of approximately one percent per generation for each copy (Bryan et al. 1987). In a popu- lation containing an autosomal site at which a mariner insertion is fixed (homozygous), how many generations would be required for the frequency of flies that are homozygous for a deletion of the element to exceed five percent? Assume that the population is large, that mating is random, that the excision factor is fixed, and that deletion of the element does not affect survival or repro- duction. 168 Chapters ANSWER Let p t be the frequency of chromosomes in which the mariner element remains undeleted in generation t, and let |i = 0.01 be the probability of deletion of the element per generation. For this situ- ation, Equation 5.1 applies with p = 0.01 and p = 1- The frequency of deletion homozygotes is greater than five percent when (1 - p t f > 0.05, or p, < 1 - (.05) 1 '* = 0.776. Thus, t should be greater than ln(0.776)/ ln(0.99) = 25.2 generations. Reversible Mutation In this section, in addition to forward mutation ol A to a, we also allow reverse mutation from a to A. In this case, the mutation pressure on the allele frequency p is in bolh directions: forward mutation tends to decrease p, reverse mutation tends to increase;?. Eventually, an equilibrium is reached in which the frequency p remains constant from generation to generation. At this point, the loss of A alleles from forward mutation is exactly offset by the gain of ,4 alleles from reverse mutation. To deduce the point of equilibrium, suppose that the rate of forward mutation from A to a is u per generation and that the rate of reverse mutation from a to A is v per generation. Let p, and q, denote the allele frequencies of A and a in generation /, so that p, f q, = 1. An A allele in generation f can origi- nate in either of two ways. It could have been an A allele in generation f - 1 that escaped mutation to a (which happens with probability 1 - p), or it could have been an a allele in generation f - 1 that mutated to A (which happens with probability v). In symbols, P* = Pi-i(1-M) + 0-Pm) v 5.2 To solve equations of this type, a useful trick is to determine whether the relation can be expressed in the form p,-A = (p M - A)B, where A and B are con- stants dependent only on p and v. Simplifying, we obtain p, - p,- t B + A{\ - B). Putting Equation 5.2 into the same form yields p, = p, _ i(l - u - v) + v. Equating like terms, we deduce that B = 1 - p - v and A{\ - B) = v. Consequently, A — v/(p + v). Hence, we can rewrite Equation 5.2 in the form Pi--^— = \p,-i—^— |(1 -p-v) (J + V { |JtV 5.3 Sources of Variation 1 69 Because the relation between p M and Pt-2 is the same as that between p, and pi_i, the solution to Equation 5 3 is r '-iT7r KuTv f(1 -^ v) ' 5.4 To understand what happens to the allele frequency in the long run, con- sider Equation 5.4 in the case when f is very large, for example lCf or 10 6 gen- erations. Even though 1 - p - v is ordinarily close to 1, the value of f eventually becomes so large that (1 - p - v) f becomes approximately 0. Thus, the whole right-hand term in Equation 5.4 goes to 0, and so p, eventually attains a value that remains the same generation after generation. Such a value of p is called an equilibrium value, which we will denote by p. In case of reversible mutation, the equilibrium is found by equating the left-hand side of Equation 5 4 to 0; hence p + V 5.5 The manner in which p, converges to its equilibrium value is shown in Figure 5,4 for the case p = 10~ 4 and v = 10" 5 . Note that, whatever the initial fre- quency of A, the allele frequency o_f^_e^nju^lly.gi3es-to-p, whichin.this example equals 0.00001 /{0.0001+ 0.00001 ) = 0.091 Figure 5.4 also indicates that mutation pressure is usually veryi^eaF in changing allele frequency inasmuch as the population requires tfroUsandsor tens of thousands-of- gen- erations to reach equilibrium. 1(1,00(1 2(1,(10(1 30,1 K)0 Time (I, in generations) 40.0110 50,000 Figure 5.4 Theoretical change in allele frequency under pressure of reversible mutation. The attainment of near-equilibrium values requires lens of thousands of generations for realistic mutation rates. In this example, the forward muta- tion rate (A -> a) is \i = 10"'" and the reverse mutation rate (a -* A) is v = 10' 1 The equilibrium allele frequency of A, calculated from Equation 5.5, is 0.091. 1 70 Chapter 5 PROBLEM 5.2 The bacterium Salmonella typhimurium has a genetic switching mechanism that regulates the production of alternative forms of a protein component of the cellular flagelia. There are two alleles, which we will call A {for the "specific-phase" flagellar proper- ty) and a (for the "group-phase" flagellar property). Switching back and forth between A and a takes place rapidly enough that Equation 5.4 can be applied. The transition from A to a has a rate of u = 8.6 x 10~ 4 per generation, and that of a to A has a rate of v = 4.7 x 10~ 3 per gener- ation. These rates are orders of magnitude larger than mutation rates typically observed for other genes. The reason is that the change from A to a and back again does not result from mutation in the conven- tional sense but from intrachromosomal recombination (Simon ef al. 1980). Formally, however, we can treat the system as one with reversible mutation. In cultures initially established with the frequen- cy of A at p = 0, Stocker (1949) found that the frequency increased to p = 016 after 30 generations and to p = 0.85 after 700 generations. In cultures initiated with p a = X the frequency decreased to 0.88 after 388 generations and to 0.86 after 700 generations. How do these values agree with those calculated from Equation 5.4 using the estimated mutation rates? What is the predicted equilibrium frequency of Al ANSWER Note that v/fti + v) = 0.845. This is the predicted equilibri- um frequency (Equation 5.5). Also, 1 - u - v = 0.99444, and this quan- tity determines the rate of approach to equilibrium. For the cultures with pa = 0, the predicted values are p» = 0.845 - (0.845){0.99444) M = 0.13 and p m = 0.845 - (0.845)(0.99444) 700 = 0.83. For the cultures with p = 1, the predicied values are p m = 0.845 + (0.155M0.99444) 388 = 0.86 and p 7m = 0.845 + (0.155)(0.99444) 700 = 0.85. The predicted values are in very good agreement with the observations. Probability of Fixation of a New Neutraf Mutation The assumption of an infinite population size is not very realistic. In an improved model in which the population is finite, the change in frequency of a mutant allele depends not only on the mutation pressure but also on ran- dom sampling from generation to generation. The sampling process, called Sources of Variation 1 71 random genetic drift, results in chance changes in allele frequency The process is illustrated in Figure 5.5. The squares represent the 2N alleles in the adult population in generation f. Each allele is assigned a unique label— a,, a 2 , a,, . . . , a 2 w— to temporarily mask its identity as either A or a. The circles repre- sent the essentially infinite pool of gametes in generation t. In the gamete pool, each labeled allele has a frequency of 1/2/V The squares at the bottom represent two diploid genotypes in generation / + 1 formed by random sampling from the pool of gametes. By chance, the two alleles forming a Alleles in breeding population in generation t- I &&&$&. Gametes (each type with frequency I/2N) Generation f or, | a,- Probability 1/2N 1 - 1 /2N Figure 5.5 Random sampling of alleles in a finite population increases the probability of identity by descent (IBD). Two randomly chosen alleles, illustrat- ed in the squares at the bottom, may be IBD either because they are replicas of »ne same allele in the immediately preceding generation («,«,) or because rhev are replicas of the same allele in a more remote generation («,o,) 1 72 Chapter 5 genotype may be replicas of the same allele in the previous generation, for example, a,a, Alternatively, the two alleles forming a genotype may come from different alleles in the previous generation, for example, a,a r The random sampling from the gamete pool means that some alleles may he overrep resented in generation t + 1, relative to their frequency in genera- tion t, and some alleles may be underrepresented Indeed, any particular allele has a good chance of being unrepresented in generation f + 1, and hence the lineage of that allele is terminated. To be precise, each allele in gen- eration f has a chance of approximately 1/c = 0.368 of not being represented in generation / + 1 . To understand why, consider the allele designated cti. The frequency of a, in the gamete pool is 1 f2N, and the frequency of all other alle- les together is therefore 1 - 1/2W. Because the genotypes in generation t + 1 are formed by the random selection of 2N alleles from the pool of gametes, the distribution of the number of oc, and non-a, alleles present in generation f + 1 is given by successive terms in the binomial expansion (Chapter 1): r 1 d M' ot| + 1 a IN \ 2NJ 5.6 in which oc represents the collection of all alleles other than a t . Hence, the probability that ct| is not represented in generation f 4 1 is i f 1-— «l/c = 0368 2N J 5.7 The -approximation is very good even when N is quite small. For example, when N = 10, the left-hand side of Equation 5.7 equals 358, and, when N = 20, the left-hand side equals 0.363. The important implication of Equation 5.7 is that, owing to random genet- ic drift, the ancestral lineage of each allele faces a substantial risk of extinction in each generation. As time goes on, the lineages progressively disappear, one or a few at a time. Eventually, a time is reached at which all lineages except one have become extinct. At that time, every allele in the population is iden- tical by descent with a particular allele present in an ancestral population. The ultimate extinction of all but one lineage implies the answer to the question: What is the probability that a single new mutation eventually becomes fixed in a population of size 2N? The reasoning is illustrated in Figure 5.6. Parts A and B show all the alleles present in the current genera- tion, immediately after a new mutation (shaded circle) has been created. After a sufficient number of generations have passed, each of the alleles in the descendant population will descend from a single allele, chosen at ran- dom, in the current population. In part A, the descendant alleles all derive Sources of Variation 1 73 Alleles present in current generation Alleles present many generations Inter Alleles present in current generation Aliefes present morn £ener.i(i(ni Liter o o o o o o o o o o o o o o o o _^ o -* — » o o o _^ _ o — 1 o o o • o • O Probability -& o O Probabil >y ib • O o O O o O o o O (A) (B) Figure 5.6 In a finite population, the lineages of all alleles must trace hack to a single allele in some ancestral population. Here, a particular allele of interest m a diploid population of size N is indicated by the shaded circle. (A) The proba- bility the designated allele is not destined to be the common ancestor of all alle- les many generations in the future is 1 - 1 /IN. (B) The probability the designated allele is destined to be the common ancestor of all alleles manv gen- erations in the future is 1/2N. Hence, the probability of ultimate fixation of a newly arising neutral allele is 1 /IN. from one of the nonmutants in the current population; the nonmutant alleles have frequency 1 - 1/2N, and so this is the probability of ultimate fixation of a nonmutant. In part B, the descendant alleles all derive from the mutant, and so 1/2M is the probability of ultimate fixation of a new mutant allele. More generally, for neutral alleles, which do not affect the survival or reproduc- tion of the organism, the probability of ultimate fixation of a selectively neu- tral allele in a finite population is equal to the frequency of the neutral allele in the initial population For the lucky few neutral alleles that are eventually fixed, the process takes a long time: on the average, 4N generations 1 he method by which this result can be deduced is considered in Chapter 7. 1 74 Chapter 5 The Infinite-Aileles Mode! Recall from Chapter 2 that many genes have more than two alleles repre- sented among the organisms in a natural population It is therefore of some importance to determine the expected level of genetic variation under mutation pressure A convenient measure of genetic variation is the het- erozygosity (the proportion of heterozygous genotypes). If a gene has a greater heterozygosity than expected from mutation pressure alone, then other forces that operate m nature must tend to preserve genetic variation. On the other hand, if a gene has a smaller heterozygosity than expected, then other forces must tend to eliminate genetic variation. The heterozygosity of a gene is a function of the number of alleles and their relative frequencies In principle, the number of alleles of any gene could be very large. For example, a gene coding for a protein of 300 amino acids has a coding sequence 900 nucleotides in length. Because each nucleo- tide site could be occupied by either an A, T, G, or C, the total number of pos- sible alleles is 4 9m , which equals about W 542 . Hence, we can suppose that every new mutation creates an allele that does not already_exisLialhe popu- lation. This is called the infinite-alleles model of mutation. The infinite-alle- les model is but one way to specify the cbafclcferrstics of new mutations. Although it represents a somewhat simplified view ol mutation, it neverthe- less provides a useful standard of comparison for other models or for observed allele frequencies. In the infmite-alleles model, two alleles that are identical by state must also be identical by descent because of the assumption lhat each mutation creates a unique allele Hence, in this model, homozygous genotypes must be autozygous. To measure the homozygosity, therefore, we need to calculate the autozygosity. This can be done with reference to the finite-population model Figure 5 5. As in Chapter 4, we let F, be the probability that, in gener- ation I, two alleles randomly chosen from a population are identical by descent. In the context of Figure 5.5, the randomly chosen alleles are com- bined in pairs to make genotypes, and so f, is also the probability of autozy- gosity in generation f. We will use the a, a, and a,oc, genotypes in generation I in Figure 5.5 to derive an expression for F, in terms of F,_,, N, and the muta- tion rate u. First, consider the genotype a,ra, What is the probability that this genotype has alleles that are identical by descent 7 The alleles must be identi- cal by descent provided that neither allele has mutated in the course of one generation, and so the probability of identity by descent in this case is f I - p) 2 . Now consider the genotype a,a,. These alleles are identical by descent only if two randomly chosen alleles in generation f - 1 are identical by descent, and if neither allele mutated in the course of one generation, and so the probability of identity by descent m this case is F,^(l - p) 2 . Because each of the labeled a's in Figure 5.5 has the same frequency in the gamete pool (namely, 1 /2JV), the probability of a combination like a,oc, is 1 /2N and Sources of Variation 1 75 the probability of a combination like a,ct, is 1 - 1 /2N. Putting all this togeth- er, the recurrence equation for F f is r"^}^^-^)'-^-' 5.8 Eventually an equilibrium value of F, call it F, is attained in which the increase in autozygosity from random genetic drift in any generation is exactly offset by the decrease in autozygosity from new mutations. The equi- librium can be found by equating F t = F M = F in Equation 5.8 and solving. Ignoring terms in p 2 and those in u/N because they are expected to be negli- gibly small, the solution is F = - 1 l + 4Np 5.9 to an excellent approximation. Therefore, the number of selectively neutral alle- les increases under mutation pressure.untiL F satisfies Equation 5.9. Being the equilibrium value of the probability of identity by descent, F is also the equi- librium value of the ^utozygosjity: Because of the assumption in the inf mite-alle- les model that each allelejrjjhe population arises only once, all genotypes that are homozygotes must als o be au toz yg ous. Therefore, F can also be interpret- ed as the equilibrium value of the proportion of homozygous genotypes. It is an odd feature of Equation 5.9 that it gives the equilibrium homozy- gosity of a population without explicit reference to allele frequencies The natural way to write the homozygosity expected with random mating for n alleles with frequencies pi, pj, p 3 , . . . , p„, is £p, 2 =PiWj +- + />» 510 We thus have two expressions for the equilibrium homozygosity in the forms of Equatons 5.9 and 5.10. Because the two equations refer to the same thing, they must equal each other, and so Ep 2 = F = 1 /(4Nu + 1 ) Alternative approaches leading to essentially the same result are discussed in Sved and Latter (1977). The homozygosity is the proportion of homozygous genotypes in a pop- ulation; the heterozygosity is the proportion of heterozygous genotypes. Hence, homozygosity and heterozygosity are opposite sides of the same coin. Therefore, if the homozygosity in a population is given by F = 1 /(4rVp + 1), then the heterozygosity is given by 1 - F = 4Np/(4Nu + I). These functions for the equilibrium homozygosity and heterozygosity are plotted against 4Mu in Figure 5.7. The illustration shows that there is a rather narrow range 176 Chapter 5 4 6 Value of 4Nm Figure 5-7 Plot of average homozygosity and average heterozygosity for the innnite-alleles model. Intermediate values of heterozygosity are maintained over only a small range of 4Nu of 4Nu over which an intermediate level of genetic variation (heterozygosity) is maintained. For example, the equilibrium heterozygosity is in the range 0.2 to 0.8 only when 4/Vu is in the range 0.25 lo 4. A complication in the interpretation of Equation 5.10 is that any number of distributions of allele frequency can result in the same homozygosity. For example, a population in HWE with the four alleles at frequencies p\ = 0.7, p 2 = 0.1, p-f = 0.1, and p 4 = 0.1 has a homozygosity of Ip, 2 = 0.52; likewise, a population in HWE with two alleles at frequencies p\ = 0.6 and p 2 = 0.4 also has a homozyogosity of 52. The problem that many distributions of allele frequency tan result in the same homozygosity can be sidestepped by assum- ing that all alleles are equally frequent. If the population contains n equally frequent alleles, then p x = p 2 = p-\ = . . . = p n = l/rt; the homozygosity is calcu- lated from Equation 5.10 as Xp/= h(1/h) 2 = 1/n. At equilibrium, therefore, 1 fn - F = 1 /{4Wu + 1 ), or n = 4Nu + 1 . The number n of equally frequent alle- les is called the effective number of alleles, often symbolized as n c . Diverse distributions of allele frequency can be compared in terms of their effective number of alleles. Biologically speaking, h c is the number of equally frequent alleles that would be required to produce the same homozygosity as observed in an actual population. In the examples given at the beginning of this paragraph, the four-allele population and the two-allele population with identical homozygosities of 0.52 also have the same effective number of alle- les, namely n r = 1/0.52 = 1.92. Sourc es of Variation 1 77 PROBLEM S.3 An aliozyme study of a Caribbean population of Dmsophih wiWstoni (Ayala and Tracy 1974) yielded the following esti- mated allele frequencies for the loci Adk-1 (adenylate kinase-1), Lap-5 (leucine amino peptidase-5), and Xdh (xanthine dehydrogenase). Adk-1 Lap-5 Xdh Allele 1 0.574 0.801 0.446 Allele 2 0.3O9 0.177 0.406 Allele 3 0.114 0.014 0.092 Allele 4 0.003 0.004 0.034 Allele 5 — 0.004 0.014 Allele 6 _ — 0.004 Allele 7 — 0.002 Alleles — — 0.002 Estimate the effective number of alleles of each gene. ANSWER The effective number of alleles is estimated as the recip- rocal of Ipi 2 . For Adk-1, n e = 2.28; for Lap-5, n e = 1.49; and for Xdh, « e = 2.68. Note that the effective number of alleles is determined more by the uniformity of allele frequencies than by the actual number of alleles. For example, Lap-5 has more actual alleles than Adk-1 but a smaller effective number of alleles. Neutral Mutations The hypothesis that many genetic polymorphisms result from selectively neutral alleles maintained by a balance between the effects of mutation and random genetic drift is known as the neutral theory or the theory of selec- tive neutrality (Kimura 1968; King and Jukes 1969). Mutation introduces new alleles into a population, and random genetic drift determines whether a neutral allele will ultimately be fixed or lost. (Loss is the usual outcome.) At equilibrium, there is a balance between mutation and random genetic drift, so that, on the average, each new allele gained by mutation is balanced against an existing allele that is lost (or, more rarely, fixed). The balance point for the homozygosity in the infinite-aUeles model is given in Equation 5 9. In essence, the neutrality hypothesis states that many mutations have so little effect on the organism that their influence on survival and reproduc- tion is negligible. The frequencies of neutral alleles are not, therefore, 1 78 Chapter 5 determined by natural selection. Consequently, if the neutrality hypothesis is true, then many polymorphisms may have no particular significance in the adaptation of a species to its environment From the perspective of adapta- tion, selectively neutral polymorphisms are mere evolutionary "noise" and, regardless of how much their study may reveal about population structure and random genetic drift, they tell us lit tie or nothing about adaptive genetic changes in evolution. Kimura (1968) gave the irony a positive spin by noting that "if my chief conclusion [about the prevalence of neutral alleles] is correct, then we must recognize the great importance of random genetic drift ... in forming the genetic structure of biological populations." Quite so. Indeed, while neutral alleles are unsuitable for the study of genetic adaptation, Ihe very fact that they are invisible to natural selection makes them ideal for mapping the geographical structure of populations and for tracing the ances- tral lineages of DNA sequences to make inferences about the phylogenetic relationships between species Because the neutrality hypothesis is of fundamental importance in popu- lation genetics and evolution, it has been a subject of considerable discussion. The neutrality hypothesis was put forward in the late 1960s at a time when most of the genome was supposed to have a protein-coding function. Introns and other noncoding sequences were unknown. Today it is clear that only about 4 percent of the mammalian genome codes for proteins. The low cod- ing density affords ample scope lor mutations that have little or no effect on fitness, including some (but by no means all) mutations in introns, pseudo- genes, spacers between genes, noncoding DNA in the centromeric region of chromosomes, and so forth. There is still considerable controversy whether amino acid polymor- phisms are selectively neutral or nearly neutral. To assess the plausibility of the neutrality hypothesis, many aspects of the model must be compared with the situation in actual populations. One aspect of the hypothesis developed in Ihe preceding section concerns the homozygosity to be expecled with the infinite-alleles model. Using an observed allozyme homozygosity, we can estimate the effective number of alleles n c and, from the expression n v = 4Nu + 1 , estimate the corresponding value of Nu. If the resulting values are grossly unreasonable, we can safely reject the infinite-alleles version of the neutrality hypothesis (or at least argue that actual populations cannot be in equilibrium). Recall from Chapter 2 that observed values of heterozygosity of allozyme genes range from 0.04 to 14 in most organisms (see Figure 2 9). Observed homozygosities therefore range from 1 - 04 = 0.% to 1 -0.14 = 0.86, which corresponds to estimated n, in the range 1 /0.96 = 1 .04 to 1 /0.86 = 1 16. Esti- mates of Nu, calculated as (n, - 1)/4, therefore range from 0.01 to 0.04. The fact that the maximum estimated value of Nu differs from the minimum by a factor of only about four is surprising, inasmuch as the population number Sources of Variation 7 79 in Afferent species ranges over a factor of 10 4 or more. The apparently too uniform datnbuhon of allozyme homozygosities among dive se organise has been interpreted as implying that the neutrality hypothesis is wrong m ammo and polymorphs. On the other hand, estimates of the population number m natural populations are generally imprecise because the studies are very difficult, and estimates of u, which in this case is the mutatn n r Veto neutral alleles, are even more uncertain. Figure 5.8A shows a second type of test of the adequacy of the neutrality genes. The shaded histogram is the observed distribution of heterozvgosity t ZnTTT^ W T7 S ,- ^ hiSt ° gram ° Ut,ined '" so,id li ™ * * <imp J er-generated theoretical distribution expected with the infinite-alleles model lo e Jl rnmi*^ heter °W sit y is °- 099 ' ^d the theoretical heterozy^ gosity is 0.091. The correspondence between the histograms is fairly good (A) <B) ~ 0.50 0.25 £■ 0.04 - 0.1 0.2 3 4 0.5 0,6 o n1 ( , 2 Heterozygosity AvcMge hcfero2yg0 ^ y * Mammals (33 .species) * Birds (2 .species, 1 subspecies) ° Fish (18 species, 1 subspecies) * Lizards (21 species.) - Amphibians Q species, 1 subspecies) Figure 5.8 (A) Observed distribution of allozyme heterozygosity amone El? U i a , S r S f a D ded)aJ0n S With theoretol distribution forSJhve neutrality (solid lines). (B) Mean and variance of heterozygosity amone ni teTeKSel T^' ^ B ° Hd '** » ^ *»^«l ™S fofthe ,„fl- I tch a mSS r Z^ mUtat,m rafe to neUtraI alMe * varies among gone* in such a manner that the va nance in mutation rate equals the square of the mean mutation rate. (After Nei et al 1976.) q 180 Chapter 5 but the observed distribution seems to include too many genes with het- erozygosities in the range of 0.35 to 0.55 (For a possible ex phi nation, see Fuersteta! 1977.) A third type of test of the neutrality hypothesis is shown in Figure 5.8B, which presents data on the mean and variance of heterozygosity in 77 verte- brate species. The curve is the theoretical expectation from the infinite-alleles model when the rate of selectively neutral mutation varies among genes (Nei et ai. 1976). At first glance, the fit in Figure 5 8B is impressive. On the other hand, the observed points are sufficiently scattered that any number of other curves might fit at least as well. Evidently, statistical comparisons of this sort are too lacking m power to distinguish between the hypotheses, A brief consideration of the phrase lacking in power may be in order. The neutral theory is useful in being a sort of starting point, or null hypothesis, which provides predictions about the relationships among observed quanti- ties that can be confirmed or rejected. Statistical tests of the neutral theory are similar to other types of statistical tests in that two distinct types of possible errors must be balanced. If the tests are too demanding (for example, in fail- ing to allow for the effects of random sampling error), then data may often result in rejection of the hypothesis even when it is true. False rejection is called Type I error. On the other hand, if the statistical test allows too much latitude in the data, then data will seldom result in rejection of the hypothe- sis even when it is false. False acceptance is called Type II error. The tradeoff between Type 1 error and Type II error is that the probability of Type I error cannot be decreased without increasing the probability of Type II error, and vice versa. By convention, statisticians usually adopt a 5 percent criterion for rejection of the null hypothesis even when it is true. This is the familiar 5% level of statistical significance, and it means that there is a 5% chance of rejecting a true hypothesis (Type I error). With this convention, the probabil- ity of a Type 11 error (failing to reject a false hypothesis) falls where it may, and a test with a relatively high probability of Type II error is said to be lack- ing in power. Although the comparisons in Figure 5.8 are lacking in power and hence are inconclusive in their support of the neutrality hypothesis, many other observations and types of data have been brought to bear in assessing fhe hypothesis. These data often rely on comparison of nucleotide sequences of DNA in different genes or in different species. These types of comparisons and the conclusions from them are discussed further in Chapter 7. LINKAGE AND RECOMBINATION In the context of genetic variation, the importance of recombination is that it allows linked alleles to become associated in many different combinations. In a random mating diploid population, as discussed in Chapter 3, linked Sources of Variation 181 alleles come into random association {linkage equilibrium} at .i rate deter- mined by the frequency of recombination r (Equation 3.8). If r is .small, it may require many generations for linkage equilibrium to be attained For exam- ple, the average rate of recombination between adjacent nucleotides in Drosophila is 2.7 x 10 8 , with wide variation in different parts of the genome, and so nucleotide polymorphisms in the same region of the genome are often in linkage disequilibrium. Consequently, the ultimate fate of a new mutation may depend to a considerable extent on the effects of other poly- morphisms with which it is very closely linked. The effect of recombination on the fate of genetic variation is the subject of this section. Presumed Evolutionary Benefit of Recombination Evolutionary biologists have long taken it for granted that recombination is important in evolution because it accelerates the rate of formation of benefi- cial gene combinations. A graphical representation of the process is illustrat- ed in Figure 5.9. In part A are two large populations, one with no recombi- nation (an asexual species) and one with recombination (a sexual species). Each has three favorable mutations, a, b, and c, which ultimately become incorporated into the genome. In the asexual species, the mutations are incorporated sequentially because each favorable mutation must take place in the genetic background of the one before. The process is slow because each favorable mutation must be nearly fixed before there is a high chance that the next favorable mutation takes place in the proper genetic back- ground. In contrast, in the sexual population, there is no such problem. Recombination between the genes allows that triple mutant abc to be formed almost immediately. The evolutionary advantage of recombination outlined in Figure 5.9A does not apply as strongly to the small populations in Figure 5.9B. In a small population, three favorable mutations are unlikely to be present simultane- ously, and so the fixation of the favorable alleles proceeds sequentially in a sexual as well as in an asexual species. Recombination and Polymorphism Because recombination between adjacent nucleotides is infrequent, nearby nucleotide sites tend to evolve together. Owing to genetic linkage, forces that tend to maintain genetic diversity or that Lend to reduce genetic diversity will act regionally. Therefore, the level of polymorphism found in any region ol the genome is expected to be correlated with the level of polymorphism in a closely linked region. Evolutionary forces thus leave their mark on the level and type of genetic variation found within closely linked regions of the genome. In D melanogaster, an important pattern of genetic polymorphism associ- ated with degree of linkage is illustrated in Figure 5.10. A region of the 182 Chapters (A) Large population Timi: (B) Small population Fiqure 5 9 Evolutionary effect of recombination (A) In a large population of an asexual species with no recombination (top panel), the favorable mutations a, b and c must he incorporated into the genome sequentially because there is no mechanism to brine the favorable mutations together; each favored mutation must reach a high frequency to have a reasonable chance that the next favorable mutation will take place in the proper genetic background. With recombitiaUon fboltom panel), recombination between the favorable genes enables the nple mutant lib t to be formed very rapidly (B) The beneficial effect of recombination is diminished in a very small population because, in a small population, multi- ple favorable mutations are unlikely to be present simultaneously. (From Crow and Knnura 1970.) Sources of Variation 183 0012 r 010 V p- 0U0B 1 006 01 "O o IHHI4 « o 3 002 004 06 Rate of recombination Figure 5.10 Observed relation between the level of nucleotide polymorphism and the rate of recombination in Drosophila. (From Aquadro ef al. 1994 ) genome in which the rate of recombination per nucleotide is reduced, such as near the tip or near the base of each chromosome arm, also tends to have a reduced level of genetic polymorphism even though the rates of mutation are uniform across I he chromosome (Aquadro et al. 1994). In Figure 5.10, the level of polymorphism is expressed as the proportion of nucleotide sites that are polymorphic (called 6 in Chapter 2). For the regions plotted, 6 ranges over more than a factor of 10, so there is clearly an important effect of close linkage in reducing the level of polymorphism. In theory, the reduction in the level of polymorphism in regions of tight linkage could be explained by either of two diametrically opposed mecha- nisms. In one mechanism, the reduction results from the fixation of favor- able mutations. In the other mechanism, the reduction results from the elimination of harmful mutations. These explanations have somewhat differ- ent implications for the pattern of polymorphism in regions of tight linkage, and so they can be distinguished experimentally. Consider first the consequences of fixation of a favorable mutation. On its way to fixation, any new favorable mutation may carry along a small sur- rounding region of the genome and render the region monomorphic. The monomorphism will not usually be complete. Some degree of polymorphism may remain in the region, either because new mutations happen in the process of fixation or because of rare recombination events that take place 184 Chapters The process in which a favorable mutation becomes fixed in a population is called a selective sweep. During a selective sweep of a favorable allele, any neutral alleles sufficiently tightly linked go along for the ride and are said to be hitchhiking. The main effect of hitchhiking is that a small region around the favored allele will be overrepresented in the population. In other words, there will be an apparent excess of rare genetic variants owing to the over- representation of the region that profited from the hitchhiking Consider next the consequences of a harmful mutation. For concreteness, consider the genetic map diagrammed in Figure 5.11 A, in which the short vertical lines indicate adjacent nucleotide sites One site that can undergo neutral mutation is embedded in the middle surrounded by sites that can (A) L/= Zu R= Er (B) Neutral site 1.0 r 0.1 0.2 3 04 Recombination frequency across region (R) Figure 5. 11 Effects of background selection on nucleotide polymorphism (A) A region of a chromosome containing a set of genes (tick marks) that can mutate to detrimental alleles; within this set of genes is a single neutral site. The muta- tion rate per locus is u and the rate of recombination between adjacent loci is r. (B) Relative nucleotide diversity as a function of U, the total mutation rate, and R, the total recombination rate, across the chromosomal region Note the posi- tive correlation between level of nucleotide polymorphism and rate of recombi- nation. Sources of Variation 185 undergo harmful mutations only. The rate of harmful mutation per site per generation is denoted u, and the rate of recombination between ad|acent sites is denoted r. Suppose further that each mutation, even when heterozygous, is suffi- ciently harmful that any chromosome in which a mutation is present is ulti- mately doomed, In the absence of recombination, the fate of a chromosome depends on whether it is free of harmful mutations because, under our assumptions, no chromosome can persist for long unless it is free of muta- tions. The effect of harmful mutation, which in this context is called back- ground selection, is to reduce the number of chromosomes that can contribute to the ancestry of remote generations. Indeed, the effect of back- ground selection is identical to that of a reduction in population size except that the reduction applies, not to the genome as a whole, but to a tightly linked region (Charlesworth et al. 1993). Background selection therefore reduces the level of genetic polymorphism. Looser linkage means that a linked neutral mutation can escape the fate of a harmful neighboring muta- tion by recombination with a mutation-free chromosome. Hence, the tighter the linkage, the greater the reduction in polymorphism due to background selection. Although there is a reduction in the level of polymorphism, back- ground selection does not skew the distribution of rare polymorphisms because, for all practical purposes, the harmful allele merely causes one chro- mosome to drop out of the population, much as if it were to go extinct by chance (Braverman et al. 1995). Although the evidence is not yet conclusive, the model of background selection appears to provide a better explanation of the Drosopfiila data than does the model of selective sweeps (Hudson and Kaplan 1995; Charlesworth et al. 1995). The evidence is that rare nucleotide polymorphisms are found at a frequency that would be expected given the overall level of polymorphism (Braverman et al. 1995). There is no evidence for a skewed distribution toward rare variants that the model of selective sweeps would predict. The effect of background selection on the level of genetic variation is shown graphically in Figure 5.11 B for the genetic map diagrammed in part A. The curves are plotted from the formula ji = h„c- u '< 2 '" +i ' 5.11 (Hudson and Kaplan 1995). The symbol n is the nucleotide diversity, defined as the average proportion of nucleotide differences between all possible pairs of sequences (Chapter 2); re,, is the value of n in the absence of background selection. U and R refer to the diagram in part A. U is the total mutation rate per d iploid genome, summed across all genes in the region; and R is the total rate of recombination across the region, summed over each of the intervals between genes. The quantity hs measures the degree of harmfulness of each 186 Chapters deleterious mutation in a heterozygous genotype; the extremes are Its = 0, when there is no effect in the heterozygote, nnd lis = 1, when the heterozy- gote is lethal. The mode! on which Equation 5.11 is based includes the assumption that lis is small but not The curves in Figure 5 11 B are for the specific value hs = 0.02, which means that a genotype that is heterozygous for one deleterious mutation has a 2% reduction in survival compared with a homozygous nonmutant. For each curve, the relative nucleotide diversity (jt/n„) decreases as the total recombination rate R decreases. This result means that, with tighter linkage, each detrimental mutation that is eliminated takes with it a larger surround- ing region of chromosome. The relative nucleotide diversity also decreases as the total mutation rate increases; that is, greater background selection elimi- nates a greater number of chromosomes. Together, tight linkage and a mod- erate or high total mutation rate can result in a very substantial decrease in relative nucleotide diversity, reducing it to a level of 20% or less of that expected in the absence of background selection. In view of the reduction in genetic variation in regions of reduced recombination observed in Dwsophila (Figure 5.10), the implication of Equation 5.11, along with the absence of a skewed distribution toward rare variants, suggests that much of the effect results from background selection. Piecewise Recombination in Bacteria Many prokaryohc organisms make use of mechanisms of recombination in which a piece of DNA that is small, relative to the size of the entire genome, is transferred from a donor cell into a recipient cell These mechanisms include transformation, in wh.ch free DNA is taken up by the recipient from the surrounding medium; transduction, in which a DNA fragment is carried from the donor to the recipient by means of a virus particle; and conjugation, in which a replica of the chromosome from a donor cell is transferred into a recipient cell by a gradual process requiring cell-to-cell contact, but the chro- mosome usually breaks before the transfer is complete. Because relatively short patches ol the genome participate in recombination, these processes differ in their evolutionary implications from meiotic recombination in eukaryotes. The main effect of short-patch recombination is that long-range linkage disequilibrium tends to be maintained For example, in enteric bacteria, such as Escherichia colt, which are part of the normal intestinal flora, linkage dise- quilibrium between allozyme loci is very strong (Whittam et al. 1983). At the level of DNA sequence, however, many genes have an obviously mosaic structure in which different segments have different phylogenetic histories (DuBose et al. 1988). An example from the phoA gene, coding for alkaline phosphatase in E coll, is illustrated in Figure 5.12. Among the polymorphic nucleotide sites indicated, the unique nucleotide at each site is inscribed in a box. At the extreme ends of the gene, the alleles from strains RM217T and Sources of Variation 1 87 Nucleotide Mt-e in phoA gene I 1 1 I I 1 I 1 1 i I I [ | | , ,, , , , , Allele 6 8 ° n ° ° " 4 4 4 4 4 5 5 5 5 fi 7 7 7 8 K 2 3 5 6 6 7 7 8 2 2 7 7 9 2 5 6 8 16 8 2 5 7191847l5fi497<34 10 3 2926 RM277T C A[G]A}C G A C ^(^TffcTf ] TTtTTJ 7 TCAAT RM45E (HaCGCGAGTCACTCCCCTTCAA r RM224H C0C G[l]A]Glf]T C A C T C C C C [cIcIaIi]t]c] Figure 5 12 _ Evidence for recombination in the phoA gene in natural isolates of E cah. The pair of strains at the top are more similar at the beginning and end of die gene, the pair of strains at the bottom are more similar in the central region There is significant clustering of the nucleotide sites inscribed in boxes as expected from recombination. (Data from DuBose et al, ] 988.) RM45E are the most closely related; in the middle of the gene, from nucleo- tide sites 1425 to 1560, there is a run of polymorphic nucleotides in which the similarity between RM217T and RM45E is lost, as if this part of the gene had been introduced by recombination with a more distantly related allele Although short runs of similar or dissimilar nucleotides can also be the result of chance, chance effects can be ruled out by appropriate statistical tests for recombination (Stephens 1985; Sawyer 1989). The finding that many genes have a mosaic ancestry through recombina- tion seems at first to contradict the finding of signincant linkage disequilib- rium between more widely separated genes. The paradox ,s resolved by the fact that each recombination event is local; it replaces a relatively short stretch of the recipient chromosome, and the linkage phase between more distant alleles is maintained. The E. coli chromosome, therefore, consists of clonal segments from a common ancestor, which is called the clonal frame (Milk- man and Bridges 1990, 1993), interrupted by short segments derived from recombination with diverse other clones. Even though the clonal frames are interrupted by relatively short recombinant segments, their integrity would ultimately be lost unless (here were occasional selective events favorine par- tic u la r gen o types. K Absence of Recombination in Animal Mitochondrial DNA Studies in animal population genetics often focus on the DNA of mitochon- dria. The mitochondrial genome is informative about parentage because in most species of animals, it is maternally inherited and does not undergo recombination. It is also a small molecule present in abundant quantities in most cells. In animals, mitochondrial DNA (mtDNA) is a circular molecule typically ,n the range from 15 to 20 thousand base pairs in length. It codes tor fewer than 40 genes; approximately half code for nbosomal RNA or for 188 Chapters transfer RNA used in mitochondrial protein synthesis, and the remaining genes code for proteins used in electron transport or oxidative phosphoryla- tion. In many species, including mammals, parts of the mtDNA sequence evolve very rapidly in comparison with nuclear genes, and hence mtDNA can often be used to make inferences about population structure and recent population history, An example of the utility of mtDNA in population studies is illustrated in Figure 5.13, which summarizes the result of examining the mtDNA of 87 pocket gophers, Geomi/s pinetis, collected across the geographic range of the species in Alabama, Georgia, and Florida (Avise et al. 1979). The mtDNA Figure 5. 13 Lineage relationships between mtDNA types in pocket gophers. The lowercase letters are different mtDNA types grouped according to similari- ty and superimposed on a geographical map of the collection sites. The tick marks across the connecting lines arc the numbers of inferred mutational steps. (From Avise 1994) Sources of Variation 1 89 from each gopher was digested in turn with each of six restriction enzymes, each cleaving the DNA at a different six-base recognition sile. The resulting restriction fragments were separated by electrophoresis and compared among the animals to estimate the number of nucleotide differences affecting the restriction sites. Among the 87 gophers, there were 23 distinct types of mtDNA, repre- sented by the lowercase letters in Figure 5 13. Each of these types represents a maternal mtDNA lineage, distinct from other lineages. Animals that share an mtDNA type must have a female ancestor in common. The branching net- work in Figure 5.13 estimates the matriarchal phytogeny of the mtDNA. The straight lines connect related types of mtDNA, and the number of slashes across each line indicates the estimated number of nucleotide differences in the restriction sites between the mtDNA types. Groups of related mtDNA types are enclosed in thin black lines; the thickest lines delineate a western and an eastern subpopulation of gophers whose overall mtDNA sequence differs by an estimated 3%. Between the eastern and western subpopulations, there are 9 nucleotide differences among the sites cleaved by the restriction enzymes. The mtDNA network in Figure 5.13 also resolves population subdivision within the western and eastern subpopulations. This subdivision is indicated by the mtDNA types circumscribed by the thin black lines. Some of the mtDNA types such as "k" and "p" are widespread, whereas others such as "b" and "q" are more local in their distribution. The local clones usually dif- fer from the most widespread mtDNA type in the region by only one or two nucleotides among the sites cleaved by the restriction enzymes. The example in Figure 5.13 shows that, because of matrilineal inheritance and the absence of recombination in mtDNA, the network of mtDNA types can reveal a great deal about population substructure in natural populations. MIGRATION In a subdivided population, random genetic drift results in genetic diver- gence among subpopulations. Migration, which refers to the movement of organisms among subpopulations, is a sort of genetic glue that holds sub- populations together genetically and that sets a limit to how much genetic divergence can take place. To understand the homogenizing effects of migra- tion, it is useful to study migration in several simple models of population structure. One-Way Migration When migration takes place predominantly from one population into anoth- er, without an equal amount of migration in the reverse direction, then there is said to be one-way migration. An illustration of one way migration 190 Chapters Mainland Allele frequency of A = f. Allele Fi equency of n = (] Figure 5.14 Model of one-way migration from a large land mass onto an island. The allele frequencies in the source population, p* and if, are assumed to remain constant, whereas those in the recipient population, p, and rj f , change with time. between a large mainland population and a small island suhpopulation is shown in Figure 5.14. For simplicity, we consider a gene with two alleles, A and a, with respective frequencies p* and </* on the mainland and p and q on the island. Suppose that, in any generation, a proportion in of zygotes in the island subpopulation originates as a random sample of organisms from the mainland. Then, if p and p' are the frequencies of A in the island subpopula- tion in two successive generations, it follows that p' = (1 -m)p+ mp" 5.12 In Equation 5.12, m is called the migration rate between the mainland and the island. Subtracting/?* from both sides of Equation 5.12 and simplify- ing leads to the expression p' - p* = (1 - »,)(;> - p*); from this expression it fol- lows immediately that p, -;>* = ( 1 - w)'(Po - p*), where p, is the frequency of A in the island subpopulation in generation t, Flence, p, =p' +(l-m)<(p„-p*) 5.13 Equation 5.13 expresses mathematically what should be clear intuitively: With one-way migration, the allele frequency of A in the island subpopula- tion gradually approaches that of the mainland population, and the rate of approach is m per generation. As a check on Equation 5.13, note that, when t = 0, then p, - p a , as must be the case, and as t becomes large, p, -> p*. As an evolutionary process that brings potentially new alleles into a pop- ulation, migration is qualitatively similar to mutation. The major difference is quantitative: Generally speaking, the rate of migration among subpopula- Hons of a species is vastly greater lhan the rate of mutation of a gene. The contrast is illustrated in Figure 5.15 for the unrealistic case in which the A Sources of Variation 191 100 200 300 400 500 Time (/, in generations) Figure 5.15 Change of allele frequency with one-way migration assuming that an allele A is initially fixed in the recipient population and absent in the source population. The migration rate is m = 0.01. Note that this is the same curve as in Figure 5.1 except that the horizontal axis is compressed to 500 gener- ations. The time scale is different because, generally speaking, the migration rate in is much larger than the mutation rate u. allele present in an island subpopulation is absent on the mainland, hi this case, Equation 5.13 becomes p, = p„(l - m)' f which has the same form as Equa- tion 5.1 for one-way mutation except that m replaces u. The identity in the shape of the curves is apparent, but the time axis in Figure 5.1 5 is compressed because, when m = 0.0!, as in this example, compared with the value of u = 0.0001 in Figure 5.1, it requires only one generation of migration to change the allele frequency to the same extent as 100 generations of mutation. Equation 5.13 holds more generally for one-way migration by letting pbe the frequency of any allele in the population that receives the migrants and p* be the frequency of the same allele in the population that supplies the migrants. Application of this equation to estimating the amount of genetic migration in certain human populations makes use of the allele-frequency data given in Problem 4.4 (page 126). The data pertain to blacks and whites in Claxton, Georgia, and blacks in West Africa. The case of the MN blood groups serves as an example, fn West Africa, which for the purpose of this problem may be regarded as the ancestral black population, P« = 0.474 for the allele frequency of M. In present-day Claxton blacks, p, = 0-484. The Claxton while population may reasonablv be regarded as repre- sentative of the source of the migrants, and for Claxton whites, p* = 0.507 Blacks came into (he United States on a large scale from West Africa about 300 years ago, hence f is about 10 generations. Substituting these estimates 192 Chapters inlo Equation 5.13, wc obtain 484 = 507 + (1 - m)'"(0 474 - 0.507), from which we infer that m = 0.035 per generation This estimate can be interpret- ed as implying that, in the genetic history of the population of Claxton blacks, about 3.5% of the alleles of the MN gene in any generation were newly introduced by genetic migration from whites. The apparent amount of migration estimated by this method differs from one locus to the next. It also differs according to the geographical region in which the white and black populations reside. PROBLEM 5.4 Estimate the amount of migration from whites to blacks using allele frequencies for each of the other genes in Problem 4.4 in (page 126). ANSWER Ss blood group: m = -0.013 per generation; Duffy: m * 0.011; Kidd: m = -0.028; Keil: m = -0.005: G6PD, m = 0.039: hemoglo- bin p: m= 0.071. Problem 5.4 illustrates some of the difficulties in estimating racial admix- ture from allele frequencies. The positive values of m vary widely, and the negative values are not consistent with the proposed model of migration. Cavalli-Sforza and Bodmer (1971 ) remark that "The weakness of the analysis is mostly due to the uncertainty of the origin of black Americans ... and the variability of gene frequencies in the probable area of the slave markets in West Africa. In addition, it is unavoidable that gene frequencies have changed somewhat from their original values, due to drift or, in some cases, selection. The opportunities for admixture, and the time available for it, must also have varied widely." The most reliable gene among those in Problem 5.4 is probably that for the Dufly blood groups because the Fy" allele is virtually nonexistent in all of West Alrica. For this gene, the estimate of m is about one percent per generation, a result that is consistent with the average value for a large number of other genes (Cavalli-Sforza and Bodmer 1971). The Island Model of Migration In the island model of migration, a large population is split into many sub- populations dispersed geographically like islands in an archipelago. Examples of island population structure might include fish in freshwater lakes or slugs in dispersed garden plots. Each subpopulation is assumed to Sources of Variation 193 be so large that random genetic drift can be neglected Consider an allele A with an average allele frequency among the subpopulations equal to p. Migration is assumed to happen in such a way that the allele frequency among the migrants equals the average allele frequency among the subpop- ulations, namely, /'. The amount of migration is again measured by the para- meter w, which equals the probability that a randomly chosen allele in any subpopulation comes from a migrant. Let us consider a particular subpopu- lation with an A allele frequency of/?, in generation f For a randomly chosen allele in this subpopulation in generation t, the allele could have come from the same subpopulation in generation t - 1 with probability 1 - m, in which case it is an A allele with probability fVi- Alternatively, the allele could have come from a migrant in generation I - I with probability m, in which case it is an A allele with probability p. Because all evolutionary processes other than migration are ignored, p stays the same in all generations, Altogether, Pi =p f -i(\-m) + pm 5.14 Equation 5.14 is similar to Equation 5.2 for mutation, and its solution in terms of p is Pt=p + (l-m) l (p -p) 5.15 The similarity with Equation 5.13 is apparent: in fact, the equations are identical except that the role of p* in one-way migration is replaced with p in the island model. Perhaps less obvious is the similarity with Equation 5.4 for reversible mutation, in which case v/(u 4 v) plays the role of p and u 4 v plays the role of m. The correspondence between the equations again empha- sizes the similarity between the effects of migration and those of mutation. The processes result in similar mathematical expressions because both muta- tion and migration act linearly on allele frequency, which means that p, is a linear function of p M . Although Equation 5.15 for migration is mathematical- ly similar to Equation 5.4 for mutation, the biological implications are quite different Because rates of migration are typically much greater than rates of mutation, changes in allele frequency are generally much faster wilh migra- tion. As an example of the use of Equation 5.15, suppose there are only two populations with initial allele frequencies of A of 0.2 and 0.8, respectively, with m = 0.10. Thus 10 percent of the organisms in either subpopulation in any generation are migrants having an allele frequency of A of p = (0.2 4 0.8)/2 = 0.5. What is the allele frequency of A in the two populations after 10 generations? For the population with initial allele frequency 0.2, we substitute p - 0.2, p = 0.5, and m = 0.10 into Equation 5.15 to obtain p 10 = 0.5 4 (1 - 0.10) ]0 (0 2 - 0.5) = 0.395; for the other population, we substitute Pn - 0.8, p = 5, and m = 0.10, and so p w = 0.5 4 (1 - 0.10) '"(0.8 - 0.5) - 0.605. 194 Chapters Migration rate = m = 1 Equilibrium frequency = p 10 20 30 40 Time (t, in generations) Figure 5.16 Change of allele frequency with time in five subpopulations exchanging migrants at the rate w = 0.1 per generation. Note the rapid conver- gence to a common equilibrium frequency Another example using Equation 5.15 is shown in Figure 5.16, where there are five subpopulations (initial frequencies 1, 0.75, 0.50, 0.25, and 0), again with m = 0.10. Note how rapidly the allele frequencies converge to the same value, in this case, 0.5 How Migration Limits Genetic Divergence It is remarkable how little migration is required to prevent significant genet- ic divergence among subpopulations as measured by, for example, the fixa- tion index F S1 . To understand the homogenizing effect of migration, consid- er the model in Figure 5.5 (page 171), in which two alleles drawn at random from a subpopulation in generation f are replicas of the same allele in genera- tion f - 1 with probability 1/2N and replicas of different alleles in generation I - 1 with probability 1 - 1 /IN. In the first case, the alleles are necessarily iden- tical by descent; in the second case, they are identical by descent with prob- ability F M , where F is shorthand for F sr . In either case, the identity by descent is unbroken only if neither allele is replaced by an allele from a migrant, and so HiB 1 -""'^-^) 1 -'"^- 5.16 Illustrating again the analogy between migration and mutation, Equation 5.16 is identical to Equation 5 8 measuring the effect of mutation on the probability of identity by descent, except that m replaces p. The equilibrium value F of F can be found by setting F = F t = F M ; after expanding the squared Sources of Variation 195 terms on the right hand side, and assuming that m is small enough, and N large enough, that terms in m 2 and m/N can be ignored, some rearrangement leads to F = 1 l + 4Nwi 5.17 As might be expected. Equation 5.17 is identical in form to Equation 5.9 for mutation but the biological implications are very different owing to the fact that the rate of migration is typically much greater than the rate of muta- tion The product Nm in Equation 5.17 has a straightforward biological inter- pretation. The totaJ number of alleles in a subpopulation of size N diploid organisms is 2N. In any generation, the proportion of alleles that are replaced by alleles from migrant organisms is m; hence the number of migrant alleles in any generation equals 2Nm. However, 2Nm is also the total number of alle- les in Nm diploid organisms, and so Nm can be interpreted as the absolute number of migrant organisms that come into each subpopulation m each generation. Because the absolute number of migrants per generation equals Nm, Equation 5.17 implies that F decreases as the number of migrants increases. Indeed, the decrease in F with increasing Nm is extremely rapid, as shown in Figure 5.17. In the extreme case of complete genetic isolation between the subpopulations, Nm = and F = 1, The decrease is then so rapid that for: • Nm = 0.25 (one migrant every fourth generation), F = 50 • Nm = 0.5 (one migrant every second generation), F = 0.33 • Nm = 1 (one migrant every generation), F = 0.20 • Nm = 2 (two migrants every generation), F = 0.1J The implication of Figure 5.17 is that migration is a potent force acting against genetic divergence among subpopulations. On the other hand, the homogenizing effect of migration should not be overestimated. The measure of genetic divergence in Figure 5.17 is F S7 , the value of which is determined by the variance in allele frequency among subpopulations (Equation 4 6) and so is affected primarily by polymorphic alleles that are at intermediate fre- quencies. Rare alleles present in one subpopulation but absent in others have hardly any effect on F sr . Because rare alleles are rare, they are unlikely to be included among migrant organisms unless the migration Tate is very great, and so rare alleles will tend to remain present in only one or a few subpopu- lations in a local area until such time as their frequency may become great enough to be dispersed by migration. An allele found in only one subpopu- lation is called a private allele. Next we shall see that the rate of migration oin be estimated by an examination of the frequency of private alleles 196 Chapters 1 2 ■} 4 5 Number of migrant organisms per generation Figure 5.1 7 Decrease in the fixation index F s| among subpopulations at equi- librium in the island model ojf migration. The curve is that in Equation 5.17 giv- ing F as a function of Nm. In the island model, Nni is the number of migrant organisms that come into each subpopulation in each generation Estimates of Migration Rates One method of estimating genetic migration in natural populations relies on the finding that, in theoretical models, the logarithm of Nm decreases approximately as a linear function of the average frequency of private alleles in samples from the subpopulations (Slatkin 1985). Data on the average fre- quency of private alleles has been compiled and analyzed by Slatkin (1985), and the resulting estimates of Nni and equilibrium values of F ST are summa- rized in Table 5.1. There is obviously considerable variation in Nm among organisms. However, many of the values of Nm are smaller than about 2, which means that there is still considerable opportunity for genetic diver- gence among subpopulations. A second kind of approach to estimating Nm in natural populations is illustrated in Figure 5.18, which gives the distribution of estimated values ol F ST among 61 genes in natural populations of Dmsophila melanogaster (Singh and Rhornberg 1987). The average of the estimated values is F ST = 0.16, which, assuming equilibrium, is an estimate of 1 + ANm (Equation 5.17). The estimate is therefore Nm - [(1/0.16) - l]/4 = 1.3. This estimate is within the range for other Drosophila species in Table 5.1. However, there are many genes in Figure 5.18 that have F sr values greater than 0.30. An analogous method of estimating Nn: from the F sr values of polymorphic nucleotides within a gene is discussed in Hudson et al. (1994a). In Chapter 7 we will con- sider how Nm can be estimated from the genealogies of genes. Patterns of Migration Migration in actual populations is more complex than is assumed in the island model of migration. In nature, migrants come primarily from nearby Sources of Variation 197 TABLE 5. 1 ESTIMATES OF Nm AND rV Species Slrpliiinoiiwria exigua Diowplitia ii'illishm Dtiwplnh pwuihdRcuni Omnvt' t'ltmias Hyla regillti Philiodmi ounchitae f'tethoiloii cincrrus Plcfhodon dorsaits Batmchoscps piicifim ssp 1 BatHichosepF pneiften ssp 2 Batrachtvepf- aimpi Laca fa indiscltcusis Pmmii/snts cnltfoniicus Pi'WHii/satSfiotimiottif; Thowoim/s bottac Type of organism Annual plant MiiJIusc Insect Insect Fish Fi-og Salamander Salamander Salamander Salamander Salamander Salamander Lizard Mouse Mouse Gopher Estimated Nm I 4 42 9 9 10 42 14 2.1 22 0.10 64 20 0.16 1 9 22 0.31 0.86 Estimated F< Source Data from Slatkin 1Q85 152 006 I) 025 0,200 0.056 0.152 106 0.532 714 0.281 556 0610 116 102 446 225 65 nnm,f; tin ^rT^' ', of e f h ™frd values of F ST for 61 genes among natur- al populations ol Drosvpink mchmo^stcr . Although the average value of F sr sug- gests m^raf on at a level of Nm between 1 and 2. about one-tLd of the genes' have F ST values greater than 0.20. {From Singh and Rhornberg 1987 ) 198 Chapter 5 populations To the extent that nearby populations have similar allele frequen- . cies, the effects of migration ate smaller, and sometimes much smaller, than pre- dicted by the island model. Populations in nature may be strung out along one dimension, such as a river bank Populations may also be distributed regularly in two dimensions, or there may be one large population with an internal genetic structure caused by the tendency for mating to take place between organisms born in the same region. Analysis of the effects of migration in such complex population structures is usually very difficult. Among humans, migra- tion rates depend on age, sex, marital status, socioeconomic status, population density, and many other factors. Migration rates also can change rapidly, and so a full-blown theory of migration has to be extremely complex. The effects of migration on genetic differentiation of populations are seen dramatically in Figure 5.19. Part A pertains to the moth Bistort betularia, part B to the moth GonodonHs bidcntata. Both species have evolved melanic (black- ened) forms in response to heavy air pollution, and the graphs give the fre- quency of the melanic forms in the two species. The geographical area in A includes Liverpool and Manchester, as viewed from rural Wales. Note the fall-off in frequency of melanics in the nomndustrial areas toward the front of the graph. Bisfon betularia exists in low population densities and must fly rel- atively long distances to find a mate. The resulting high rate of migration hin- ders differentiation of populations, hence the smooth surface. In contrast, Gonodonhs bidetJiata exists in high population densities and the migration rate is low; hence there is substantial genetic differentiation among populations, as evidenced by the bumpy surface of the graph in part B. TRANSPOSABLE ELEMENTS A DNA sequence that can change its location within the genome is called a transposable element. In being able to create novel genome rearrangements, transposable elements are agents of genetic variation. A transposable ele- ment may insert into a coding region and inactivate a gene or insert into a regulatory region and change (he pattern of expression of the gene. Also, pairs of transposable elements may undergo recombination and create novel chromosome rearrangements. The process of transposition requires a protein, called transposase, which is usually encoded within the sequence of the transposable element itself. Most transposable elements undergo transposition through a replicalive process with DNA or RNA intermediates. In most cases, transposition to a new location also leaves one copy of the transposable element behind in its original location, so transposable elements can increase in copy number in the genome, Some transposable elements are also able to regulate their own rate of transposition. Several major classes of transposable elements can be distinguished by their nucleotide sequence organization or by the details of their mechanisms of transposition or regulation. Sources of Variation 199 Manchester (center) 1 Stockport Mold JJggerheads LUml'edr Ruthin Pwyllglas Clegyr Mawi (B) Manchester (cenrei) Stratford j^TJTT^te,, Stockport Liverpool (Broad green) Liverpool Bfiv Figure 5.19 (A) Distribu tion of melanic moths of the species Bistmi bchthu in over an area including Livcipool and Manchester, as viewed from nir.il Wales (B) Dis- tribution of melanic moths of the species Gonoriatifis bidvnMo over a smaller area man in (A) but viewed from the same perspective. (From Hishnp and Cook 1975 ) 200 Chapter 5 Factors Controlling the Population Dynamics of Transposable Elements Transposable elements were originally discovered in maize as the cause of certain genetically unstable mutations. They are now known to be ubiqui- tous among prokaryotes and eukaryotes (Berg and Howe 1989). The ability of transposable elements to increase in copy number and create novel chro- mosomal rearrangements reveals a dynamic aspect of genome structure and evolution not previously recognized Some transposable elements have become widely disseminated among organisms because of their ability to undergo horizontal transmission between reproductively isolated genomes. Often referred to as selfish DNA because transposition alone may be suffi- cient for persistence in the genome of a species, transposable elements also may occasionally create favorable mutations and thus become agents of adaptive evolution. Models for the population dynamics of transposable elements usually incorporate several features. • A rate of infection, in which genomes previously lacking the transposable element become infected with it. • A rate of transposition, which determines how rapidly the copy number increases; the effects of regulation are taken into account by assuming that the rate of transposition is a decreasing function of copy number. • A mechanism, or combination of mechanisms, for eliminating elements from the population; otherwise, the copy number would increase indefi- nitely. The usual assumption is that the presence of transposable ele- ments in the genome decreases the ability of an organism to survive and reproduce, resulting in the elimination of some elements by means of nat- ural selection, or that elements can be eliminated from the genome by means of genetic deletion. Through the study of such models, the diversity and novel attributes of transposable elements have been incorporated into the concepts of popula- tion genetics; see, for example, Langley et al. (1983), Montgomery and Lang- ley (1983), Kaplan and Brookfield (1983), Sawyer et al. (1987), Hartl and Sawyer (1988), Ajioka and Hartl (1989), Charlesworth et al. (1994). Insertion Sequences and Composite Tronsposons in Bacteria Bacteria contain several types of transposable elements. Among the simplest are insertion sequences, which are typically about 1000-2000 nucleotides in length and contain at least one long translational open reading-frame coding for the transposase protein. The transposase recognizes a short nucleotide sequence, inverted in orientation, present at each end of the insertion sequence, and so the element moves as an intact unit The bacterium Escherichia coii contains several types of insertion sequences, each different but all sharing the same sequence organization with inverted repeats and at Sources of Variation 201 least one open reading frame. The factors controlling the population dynam- ics of insertion sequences can be deduced from the distribution of numbers of each element present among a sample of bacterial strains isolated from natural sources (Sawyer et al. 1987), Population models of transposable elements in E. colt are greatly simpli- fied because the organism has asexual reproduction, a low rate of recombina- tion among strains, and a low rate of deletion of insertion sequences. The "state" of a bacterial strain with respect to a particular insertion sequence may be defined as the number of copies n of the element that are present. Among the factors that control the population dynamics are: • The rate u at which uninfected cells become infected; u is the probability, per generation, that a cell initially in state n = ends up in the state n = 1. • The rate T of transposition in infected strains; T is the probability per generation, that a cell in state n > goes to state n + 1 . • The rate S at which reproduction of infected cells is less than that of unin- fected cells. In terms of the exponential growth model in Chapter 1, if r is the intrinsic rate of increase of uninfected cells (see Equation 1.7 on page 30) and r„ is that of infected cells, then S = r - r n '. The most general models of this type allow for T and S to be functions of n, but here we will assume that they are constant. Note, however, that the assumption that T is a constant implicitly defines a type of regulation because, if the probability of transition from state n to state n + 1 is indepen- dent of n, then the probability of transposition per element present in a strain must equal T/n and this fraction is a decreasing function of h. Given constant values of u, T, and S, then it can be shown that a popula- tion of bacterial cells attains an equilibrium distribution of numbers of trans- posable elements in which the probability p, that a cell contains exactly f copies of the transposable element is equal to and p = a p, = (l-a)(l-i 0>D 5.18a 5.18ft where a = 1 - (u/S) and * = T/(T f S - it) (Sawyer and Hartl 1986, Sawyer et al. 1987). ' y Equation 5.18 can be applied to the concrete case of insertion sequence IS30 in E. coli, in which the distribution of numbers among 71 strains fits a model with a = V 2 and <$> = % With these parameters, the distribution simpli- fies to the remarkably simple formula p, = (i/ 2 )' for i > Among 71 strains, therefore, the observed and expected numbers of si rains containing t ele- ments are as indicated in Table 5.2. The strains with five or more elements have been grouped in order to carry out a X 1 test of goodness of fit. This * 2 test has three degrees of freedom because a and $ were estimated from the 202 Chapter 5 TABLE 5.2 NUMBER OF IS30 ELEMENTS PRESENT IN 71 NATURAL ISOLATES OF E. coti Number of copies of /.BO element Fxpected number of strains Observed number of strains 1 2 3 4 >5 35.5 178 8.9 44 2.2 2.2 36 16 13 2 2 2 Siwirr Data from Sawyer et al l c '87 data. The value of x 2 equals 3 48, which has an associated probability level of about 35. Thus, the simple model for 1S30 fits the observed data very well. Although the x 2 test cannot be completely trusted in this case because of the small expected numbers in some of the categories, the conclusion is support- ed by a more exact statistical test (Sawyer et al. 1987). The following problem deals with the distribution of three other insertion sequences in E. coli. PROBLEM 5.5 The distribution of ISJ fits Equation 5.18 with a = V 5 and ef» = %; IS2 fits the equation with a = 2 / 5 and <j> = 2 / 3 ; and fS4 fits with a = % and ^ = % Calculate the expected numbers for 71 strains and carry out a x 2 rest. (The observed numbers are from Sawyer et al. 1987.) Wo. copies 1 2 3 4 2S IS1 11 14 8 6 7 25 IS2 28 8 12 5 5 13 1S4 43 5 5 3 5 10 ANSWER For IS3, the expected distribution is given by p = l / 5 , P, = (%»)(%)' for 1 * ' £ 4, and p 55 = 1 - (p + p, + p 2 + ps + p4>- For IS2, the expected distribution is p = 2 /s and p- s = ( 3 /ioK%)' (1 S ( * 4). For IS4, the expected distribution is p = 2 / 3 and p/ = OAHW Sources of Variation 203 (1 < / < 4). Expected numbers, % 7 values, and associated probabilities are: Wo. copies J 2 3 4 25 x z P value ISI 14.2 9.5 7.9 6.6 5.5 27.4 3.58 035 IS2 28.4 14.2 9.5 6.3 4.2 8.4 6.31 0.10 1S4 47.3 5.9 4.4 3.3 2.5 7.5 4.00 28 As in the case of IS30, more exact statistical tests confirm the con- clusion that the model fits. However, the distribution of IS1 has a very long lail, with nine strains containing from 15 to 20 copies and six strains containing from 21 to 30 copies; this distribution is approxi- mated even more closely by a model in which the regulation of trans- position decreases more gradually than T/n (Sawyer et al 1987). Apart from their own evolutionary dynamics, insertion sequences are important because they can mobilize other sequences in the genome. When two copies of an insertion sequence are on flanking sides of an unrelated sequence, the inverted repeats used in tiansposition are preferentially those at the extreme ends. This kind of insertion-sequence sandwich constitutes a composite transposable element or transposon, which transposes as a single unit. In a composite transposon, the central sequence can include one or more genes that confer a selective advantage on the host cell, such as a gene for resistance to an antibiotic; hence, the possession of the transposon would be favored in an environment containing the antibiotic Mobilization of genes for antibiotic resistance, heavy-metal resistance, and other functions is one of the principal evolutionary implications of transpos- able elements in bacteria. Transposable elements enable the piecewise assem- bly of specialized, infectious molecules called plasmids. Plasmids are autonomously replicating, circular molecules of DNA that exist within bacte- rial cells. Many plasmids contain genes that promote their transfer between different organisms. They may also contain genes, such as those for antibiotic resistance, that are highly advantageous to their hosts in certain environ- ments. These genes are often contained in transposons, and they undoubted- ly entered the plasmid through transposition from a different plasinid or from the genome of a previous host Infectious plasmids containing multiple antibi- otic-resistance genes are called resistance transfer factors, and they are a major source of multiple drug resistance in pathogenic bacteria 204 Chapter 5 Trampoiable Elements in Eukaryates Transposable elements can have important genetic consequences as muta- genic agents by the creation of novel genes, by alteration of the expression of genes in their vicinity, and in the genesis of major genomic rearrangements. Transposable elements also have important implications in population genetics and evolution Several major classes of transposable elements have been identified that differ in the molecular mechanisms of transposition. Within each class, the members can also differ in DNA sequence. Based on similarity in DNA sequence, transposable elements typically can be grouped hierarchically into "subfamilies," in which the elements resemble each other quite closely; "families," in which they differ from one another somewhat more; and "superfamilies," in which the differences are relatively great. Transposable elements are widespread in both animals and plants. For example, Drosophila melanogaster contains multiple copies of each of 50 to 100 different families of transposable elements (Rubin 1983). Although few of these elements have been studied in detail from the standpoint of population genetics, indirect evidence suggests that most of the elements, like insertion sequences in bacteria, are mildly harmful to the host (Golding et al. 1986, Loheetal 1995). Horizontal Transmission of Transposable Elements Among the most widespread families of transposable elements is that of the manner-like elements (MLEs), typified by the transposable element mariner. The molecular organization of the mariner element is illustrated in Figure 5 20A. The element is flanked by short (28 base pair) inverted repeats (IR) and includes a long open reading frame coding for the transposase protein (Hartl 1989) Insertion of the element is invariably adjacent to a 5'-TA-3' din- ucleotide in the host genome and is accompanied by a duplication of the din- ucleotide, so that the inserted mariner is flanked by 5-TA-3'. The target sequence and dinucleotide, as well as features of the amino acid sequence of the transposase protein, identify a transposable element as an MLE. MLEs are widely distributed among insects and other invertebrates (Robertson 1993; Robertson and MacLeod 1993). Figure 5.20B shows the dis- tribution among species in the major insect orders (Coleoptera, Diptera, and so forth). The number of copies of an MLE per genome varies widely among species, ranging from a few copies to many thousands. The MLEs in Figure 5.20B have been grouped according to similarity in nucleotide sequence and arranged in the form of a tree with the root to the left and the tips of the branches to the right. There are several subfamilies of insect MLEs, denoted mauritiana, cecropia, honeybee, and so forth. MLEs in different subfamilies are typically 40 to 50% identical in nucleotide sequence, and those within the same subfamily are usually 60% or more identical. All of the insect MLEs are more closely related to each other than they are to an MLE found in the soil nematode Cawiorliabditis eiegans. Sources of Variation 205 (A) IR Tmnsposast'-coding region IR (B) Colcnptora 2 Dip tern 3 Homiptpra 4 Hymenoptfra 5 LcpiJoptera 6 Thysanura Other -C elegmis Figure 5.20 (A) The molecular organization of the transposable element mariner showing the inverted repeats flanking the transposase-coding region. (B) Distribution of MLEs among species representing major insect orders (num- bered) Note that the MLEs can be grouped into subfamilies of elements (mauri- tiana, cecropia, and so forth) based on their similarity in sequence. C. elegans is the soil nematode Caenorhnbdilis elegans. (Data for B from Robertson 1993.) Although MLEs are widespread, their distribution is "spotty," which means that, among closely related species, a particular lype of MLE may be found in some species but not in others. Furthermore: • Any species may contain MLEs from two or more different subfamilies. • Closely related MLEs are often found in distantly related species. An example of the second principle is an MLE found in Drosophtla erecta, a close relative of D. melanogaster, which is 97% identical in nucleotide sequence with an MLE found in the cat flea CfcnocephaHdcs felts (Lone et al. 1995). For comparison, a gene coding for a subunit of the cellular sodium pump sequenced in both species shows only 39% nucleotide identity at third cod on positions 206 Chapter 5 What process can account for the virtual identity between MLEs in species as distantly related as a Drosophila and a cat flea? One possibility is that the MLE was present in the common ancestor of the species a few hun- dred million years ago and then virtually stopped evolving, so that the sequences remain almost identical today Unless the nucleotide sequence is very highly constrained, including third codon positions, this is a very unlikely possibility. Furthermore, if MLE sequences are so constrained, then why is there so much sequence variability within and among subfamilies? More likely than evolution stopping dead in its tracks for several hundred million years is the hypothesis of horizontal transmission, or the ability of an MLE to be transferred from a host species into the germline of a different, reproductively isolated species. To account for the D. erecla-C. felis case by horizontal transmission, an MLE would have to have been transmitted from a D erecla ancestor to a C felts ancestor (or the other way around) approxi- mately 3 to 10 million years ago. Many additional examples of horizontal transmission of MLEs and other eukaryotic transposable elements have been discovered. Although the process of horizontal transmission certainly takes place, the rate at which it happens and the vectors and mechanisms are as yet unknown. Once introduced into a genome, MLEs can persist through multiple spe- ciation events (Maruyama and Hartl 1991). A lineage can, however, lose an MLE, as evidenced by D. melanogasicr, which has lost an MLE {the manner element itself) present in all its closest relatives. Two processes appear to con- tribute to loss of an MLE (1) mutational inactivation, which may destroy the protein-coding function of an MLE or impair its ability to transpose; and (2) stochastic loss, by which we mean the elimination of an MLE from the genome as a result of random genetic drift. There might possibly also be a contribution from natural selection, depending on the extent to which pres- ence of the MLE itself is deleterious. From the standpoint of the host species, an inactivating mutation in an MLE may be selectively neutral, or perhaps even favorable, inasmuch as natural selection may act to minimize the harm- ful mutagenic effects of transposition. Subsequent mutations in an already inactivated MLE are presumably selectively neutral and ultimately lost by chance. The role of mutational inactivation and stochastic loss in the evolu- tionary dynamics of MLEs is supported by the spotty distribution of MLEs among closely related species. SUMMARY Mutation provides the raw material for evolutionary change but, by itself, mutation pressure is a very weak force for changing allele frequency. If allele A mutates to allele a at a rate u per generation, and a undergoes reverse Sources of Variation 207 mutation at a rate v per generation, then the equilibrium frequency of A is v/(u + v), but the population may require tens of thousands or hundreds of thousands of generations to reach equilibrium. In the infinite-alleles model, the equilibrium value of F ST for neutral alleles is given by 1 /(4Wu + 1 ), where u is the mutation rate to selectively neutral alleles; 4Np + 1 is called the effec- tive number of alleles. Fora neutral allele, the probability of ultimate fixation equals the frequency of the allele in the population. Statisiical tests of the neutrality hypothesis based on the effective number of allozyme alleles or on the allozyme heterozygosity are inconclusive owing to lack of statistical power. Recombination allows the formation of beneficial combinations of genes. Tn Drosophila, there is a positive correlation between the rate of recombina- tion and the level of nucleotide polymorphism: regions of reduced recombi- nation are less polymorphic. The reduced polymorphism could result from selective sweeps of favorable mutations or from background selection against detrimental mutations. In prokaryotes, there is extensive linkage dis- equilibrium over long genetic distances in spite of the fact that each gene may have a mosaic ancestry owing to intragenic recombination The appar- ent paradox results because recombination in prokaryotes usually involves a short stretch of DNA and the process is infrequent. In animal mitochondrial DNA, the absence of recombination enables the identification of mitochon- dria] lineages. Migration hinders genetic divergence among subpopulations In finite populations, the equilibrium value of F sl with migration is given by 1 /(4Nm + 1), and only a few migrants per generation are sufficient to keep F ST smaller than about 10%. On the other hand, a small amount of migration is usually not sufficient to disperse rare alleles among subpopulations, and so rare alleles are often unique to one or a few subpopulations. Transposable elements are ubiquitous in the genomes of all organisms. Their tendency to increase in copy number through their ability to repli- cate and transpose is usually offset by the harmful effects of the insertions themselves; hence, there is an equilibrium distribution of copy number among organisms. Some transposable elements have direct or indirect ben- eficial effects; bacterial transposons that carry genes for antibiotic resis- tance provide an example. Bacterial transposons are disseminated among organisms and among species by transmission of infectious plasmids in which the transposons may reside. In eukaryotes, horizontal transmission can take place between species in spite of absolute reproductive isolation. Many transposable elements can be grouped into subfamilies, families, and superfamilies based on their degree of nucleotide sequence similarity. The mar/ncr-like elements (MLEs) are exceptionally widespread among insects and other invertebrates. The innate tendencv of an MLE to increase 208 Chapter 5 in copy number in .1 genome is offset by mutahon.il machvation and ulti- mately stochastic loss. These offsetting processes may explain the spotty distribution of MLEs observed among closely related species PROBLEMS 1. Most protein-coding genes have a forward mutation rate {normal to mutant) that is at least an order of magnitude greater than the reverse mutation rate (mutant back to normal). Why should this be the case? 2. A classical bacterial experiment demonstrated that mutations occur at random and not in response to specific selection pressures for them. The experiment used sterilized velvet to imprint the geometrical pattern of bacterial colonies on an agar surface in a petn dish (a "plate"), which was used to replicate the pattern by impressing the velvet on sterile nutrient agar in a selective plate containing an antibiotic. Colonies on the original plate giving resistant cells on the selective plate were dispersed into sin- gle cells, spread onto a nutrient agar plate without antibiotic, and allowed to multiply into colonies. This procedure was repeated until one or more colonies on the unselective media consisted exclusively of antibi- otic resistant cells. How does this experiment prove the point? 3. Estimation of mutation rates from bacterial cultures can be tricky because, it a mutation occurs early in the life of a culture, the final fre- quency will be very high; but if it occurs late, the final frequency will be low The fluctuation test is a method for getting around this problem by growing many smaller cultures and estimating the mutation rate from the proportion of cultures that contain no mutations using the zero term of the Poisson distribution P (1 = exp{-u N), where P is the proportion of cultures with no mutations, u is the mutation rate, and N is the average number of cells per culture. In one experiment for bacteriophage Tl resis- tance, ll / 2 i) cultures contained no mutations and the average number of cells per culture was 5.6 x 10 K . Estimate u 4. If recessive lethals occur independently in Drosophila autosomes, and the probability that an autosome contains one or more recessive lethals is 0.35 (a typical figure for chromosomes isolated from natural popula- tions), what is the average number of recessive lethals per chromosome? Assume that the distribution of lethals is Poisson so that the probability of a chromosome containing exactly / lethals is P, = (w'//!)exp(-«i), where in is the mean. 5. The doubling dose ol radiation is the quantity of radiation that induces as many mutations as occur spontaneously, so the total mutation rate of organisms exposed to the doubling dose equals two times the sponta- neous mutation rate. Below are the induction rates per rad of x-rays (a Sources of Variation 209 standard measure of dose) for various genetic end points in ir.ndialed male mice, along with the spontaneous rates. Wlml are the conesponding doubling doses? ^ Induction rate/rad Spontaneous rate Dominant lethals 5 x l(r' /gamete Recessive visihles 7 x lO"*/ focus Reciprocal translocations 1 to 2 x ItrVcell 2 to IOxHr 2 /gamcfc Hxld V locus 2 to 5 x 10" '/cell 6. For irreversible mutation with a forward mutation rate u = 5 x 10"", cal- culate the allele frequency p after 10, 100, 1000, and 10000 generations, assuming;),) = 1.0. 7. If a transposable genetic element becomes fixed at a particular site but undergoes deletion at the rate of one percent per generation, how many generations are required to decrease the frequency of the element at the site to 90%? 8. The following data give the frequency q of bacteria resistant to a bacterio- phage after t generations of chemostat growth. At f = 12 hours a novel metabolite was added to the medium. a. What is the basal rate of mutation to resistance? b. What is the effect of the novel metabolite on the mutation rate? t q t 9 1 X Iff* 16 7 04x10'"* 4 3 x l(r fi 20 7.08x10* 8 5 x lfl- fi 24 712x10"'' 12 7x10* 9. In the forward and reverse mutation model, what is the equilibrium fre- quency"^ of A if a. u = 10^andv = 10" f '? b. u is increased tenfold? c. v is increased tenfold? d. both are increased tenfold? 10. In the forward and reverse mutation model, show that the time required for the allele frequency to go halfway to equilibrium is approximately t = Q.7/(u + v) generations. Use the approximation that ln(l - x) = -x when x is small. What time is required to go halfway to equilibrium when ji = 1(r 5 and v = 10*? 11. In the irreversible mutation model, what is the frequency q, of allele a in generation t if the mutation rate changes from generation to generation? If the equation q, = q f u/ is applied to this situation, what value corre- sponds top? 210 Chapter 5 12. Suppose a gene has eight alleles at frequencies 0.55, 0.20, 0.09, 06, 0.04, 0.03, 0.02, and 0.01. What is the effective number of alleles? What would the effective number be if each allele had a frequency of 0.125? 13. Why is the elfective number of alleles essentially independent of the number ol rare alleles? 14. What is the equilibrium heterozygosity in a population of effective size 50 if new neutral mutations are introduced at a rate 10~ 5 by mutation and at a rate 10~ 3 by migration? 15. If the average number of alleles of a gene is 1 + x per diploid individual, where < x < 1, Ihen what is the heterozygosity? (Note that one diploid individual is a random sample of two alleles.) 16. Calculate the autozygosity F after 200 generations in a random mating population of effective size N = 50. 17. In an isolated random mating population of effective size N, how many generations of random genetic drift are required to produce the same average inbreeding coefficient F as obtained in one generation of brother- sister mating (for which F = V*)? Use the approximation [1 - 1/(2N)1 = exp(-f/2N). 18. If a mainland population of snails has an allele frequency of 0.8 and an island population has a frequency of 0.2, how many generations are required for the island population to achieve an allele frequency of 0.5, given a migration rate of 01 ? . 19. If four populations with allele frequencies 0.2, 0.4, 0.6, and 0.S undergo migration according to the island model with m = 0.05, what are the expected allele frequencies after 10 generations? 20 In the island model of migration, how does the variance in allele fre- quency among populations change as a function of m and f? 21 When random genetic drift is offset by migration among populations in the island model, what value ol m is necessary to keep the equilibrium value of F smaller than 0.05? CHAPTER 6 Darwinian Selection Natural Selection Fitness ■ Haploid Models Diploid Models Mutation-Selection Balance ■ Complex Modes of Selection Kin Selection . Interdeme Selection a hus far in this book, the term natural selection has been used in the informal, intuitive sense used by Darwin in The Origin of Species (1859): Owing to this struggle ior life, variations, however slight and from whatever cause proceeding, if they be in any degree profitable to the individuals of a species, in their infinitely complex relations to other organic beings and to their physical conditions of life, will tend to the preservation of such individ- uals, and will generally be inherited by the offspring. The offspring, also, will thus have a better chance of surviving, for, oi the many individuals of any species which are periodically born, but a small number can survive I have called this principle, by which each slight variation, if useful, is preserved by the term Natural Selection Modern formulations of natural selection are less literary and usually compacted into a form resembling a logical syllogism: • In all species, more offspring are produced than can possibly survive and reproduce. • Organisms differ in their ability to survive and reproduce— in part owing to differences in genotype. • In every generation, genotypes that promote survival in the current envi- ronment are present in excess at the reproductive age and thus contribute disproportionately to the offspring of the next generation. 211 212 Chapter 6 Through natural selection, therefore, alleles that enhance survival and reproduction increase gradually m frequency from generation to generation, and the population becomes progiessnely better able In survive and repro- duce in the environment. The progressive genetic improvement in popula- tions resulting from natural selection constitutes the process of evolutionary adaptation. In the brief description of natural selection quoted above, Darwin uses the term individual three limes The unit of selection is the individual organism — not the species, not the subpopulation, not the sibship It is the performance of the individual organism that matters. Each individual organism competes in the struggle for existence and survives or perishes on its own. Darwin also used the terms "struggle for existence" and "survival of the fittest" as syn- onyms for natural selection, but he emphasized that he employed the terms in their widest metaphorical sense to include not only the life of the organism but also the success of the organism in leaving progeny: fecundity is as important as survival In this chapter, we shall see how Darwin's concept of "survival of the fittest" of individual organisms has been made more formal and quantitative and incorporated into models describing the change in allele frequency under natural selection These models show that natural selection acts simultaneously on different components of fitness and can operate at different levels of population structure. SELECTION IN HAPLOID ORGANISMS Selection acts on the phenotype, not on the genotype, and the total pheno- type is determined by many genes that interact with each other as well as with numerous environmental factors. However, in exploring the conse- quences of selection, it is convenient to focus on changes in the frequency of the alleles of a single gene. We shall begin by examining selection in its sim- plest form operating in a haploid, asexual organism, such as a species of bac- teria. In haploids, selection is realized as differential population growth; hence we shall make reference to the discrete and continuous models of pop- ulation growth examined in Chapter 1. The overall process of selection is identical whether population growth is in discrete or continuous generations, but the models have a somewhat different parameterization and it is neces- sary to relate the models to avoid confusion later. Discrete Generations Consider two bacterial genotypes, A and B, that reproduce asexually For simplicity, we will assume the discrete model of population growth dis- cussed in Chapter 1 and we set a and b equal to the rates of population growth of A and B, respectively. Equation 1.5 implies that A, = (1 + a)'A Q and B, = (1 f b)'B l)r where A, and B, are the number of cells of genotype A and Darwinian Selection 213 genotype B, respectively at time / Select, cm takes place when a * /, Figure 6.1A is an example m which the growth rates of A and B are a = 004 and b = 0.05, respectively Both populations increase in size exponentially, but tha of B increases faster than that of A. In most cases, we are not mtef sted in the actual number of A cells or B cells but in the proportion of all cells that are of ype A Equivalents we can examine the ratio of the number of A cells o that of B cells at time r, which is given by The outcome of selection is determined by the ratio of a to b because if a < b, then the ratio of A cells to B cells decreases until, ultimately, A is lost; (A) &xir/ =9 6x10" 'o s Z 2 x Ut 1 xW H Stiain B Strain A 20 40 60 80 100 Time {I, in generations) (B) 100 20 40 60 80 Time (f, in generations) 5? a T V i « } Disc l et l P ? puMion S rowth of lwo hypothetical bacterial strains, A and B in which the growth rate are 4% per generation for A and ■>% Son" Tn 10P ? i K u OJ d T ty ' '^ POpU K i(m SJZe is P blted ««y second gen- n roM h I'" 1 * 13 / T '' n » mbers are 1 -6 * 1° 5 ^r A and 0.4 x 10" for B. (B) Ratio Tat on ?h m S ° \ A ■ V/ CaU , Se *" B P°P ulati ™ 8'ows ^ter than the A popu- lation, the proportion of A in the total population decreases. 214 Chapter 6 conversely, if a > b, then the ratio of A cells to R cells increases without limit. Figure 1 .6B shows the change in A/B for the example in part A. From a value of 4 at the beginning, the ratio declines to a value of 1-54 in 100 generations; these ratios correspond to frequencies of A of 80 and 0.6!, respectively. In the selection in Figure 6. 1, it is not necessary to specify whether a and b differ because of survivorship or fecundity All that matters is that they do differ It is also important that the outcome depends only on (he ratio (1 + /?)/(! + b). which means that, in practice, we do not need to know the absolute growth rates of A and B but only their relative values (their ratio). In Equation 6.1, w represents the ratio (1 + fl)/0 + &)■ The symbol w is conven- tionally used in discrete models of selection and, in this example, it is the rel- ative fitness of genotype A to that of genotype B. In other words, in a haploid organism, the relative fitness equals the ratio of the growth rates. Although it is sometimes instructive to do so, it is not necessary to keep track of population size in models of selection. The variable of interest is usu- ally the allele frequency and not the population size. Therefore, let p, and q, represent the frequencies of genotypes A and B, respectively, in generation i, with p, + q t = l A method to relate the frequencies of A and B in any two suc- cessive generations is illustrated in Table 6 1. For ease of discussion, we divide each generation into three phases' birth, selection, and reproduction. In generation f - 1, the frequencies of A and B at birth are p h i and q,. Xt respec- tively. The genotypes A and B are assumed to survive in the ratio w:\, which means that w is the probability of survival of an A genotype relative to that of an B genotype. As before, the absolute probabilities of survival of the geno- TABLE 6.1 A MODEL OF SELECTION IN A HAPLOID ORGANISM, IN WHICH w IS THE PROBABILITY OF SURVIVAL OF AN A CELL RELATIVE TO THAT OF A B CELL Genotype General ion I -I /I B Frequency before selection /'i-i <7m Relative fitness »' 1 Aftei so lee lion Pm«' <7m />,_,7I' <Jm Generation / />, |?e + ij,_i Pi- w + <; M Ni.fr The It-actions in the bottom line are expressions foi the allele Frequencies in generation I m terms of those in t^neution I - I Although this model assumes cfiffrrenti.il survival, n< 1 couM ..Iso he the lebtivc pmb.ihilily of reniodutlmn iif A .l.nl B More generally, the rclnhvo fitness em 1 repiesents I ho net output of A R for the combined effects ofdifFerenh.il survival <ind icproduction Darwinian Selection 215 types are not relevant. All that matters is the ratio. After selection, the ratio of frequencies of A : B equals p M x w : ?M x 1 . If the surviving genotypes repro- duce with equal efficiency, then the frequencies at birth in the following gen- eration are given by the expressions across the bottom in Table 6.1; the denominators in these expressions are necessary to make the allele frequen- cies in generation f sum to 1, For comparison with Equation 6.1 , consider that p, is the number of A cells in generation / divided by the total; likewise, q, is the number of B cells divided by the total. Therefore, the ratio p t /q, equals the ratio of A cells to B cells in generation t because the denominators cancel. The expressions in Table 6.1 imply that the ratio oipfqin any generation equals w multiplied by the ratio otpfq in the previous generation, and so E< =w Ei± = w 2Eti = ... = w % <?m <?r-2 ft 6.2 The right-hand side of Equation 6.2 is identical to that in Equation 6.1 except that the relative frequencies p and q replace the absolute number of cells of type A and type B. Hence, to deduce the outcome of selection, we do not need to keep track of population size. AH we need to know is the relative fitness w and the initial frequencies p and ij . For application to experimental data. Equation 6.2 is often transformed by taking the logarithm: Pi ' 08| ^ l08 te) +MogM 6.3 Equation 6.3 means, for example, that if the values of p,/<j, are monitored in an experimental population of bacteria over the course of time, then a plot of log (p,/q t ) against time (in generations) should yield a straight line with slope equal to fog w. This kind of experiment is examined in the following problem. PROBLEM 6.1 In the intestinal bacterium E. colt, the genegmf codes for the enzyme 6-phosphogluconate dehydrogenase (6PGD), which is used in the metabolism of gluconate but not in the metabolism of ribose. The data below were obtained in experiments in which other- wise genetically identical strains Containing fhe alleles gnd(RM77C) and gnd(RM43A) were grown in competition in chemostats in which the sole source of carbon and energy was either gluconate or ribose (Hartl and Dykhuizen 1981). These grid alleles are polymorphic in 216 Chapter 6 natural populations and code for allozymes of 6PGD Gluconate is the experimental condition to ascertain the effects on fitness of the gnd alleles, and ribose is the control. In the table, p, denotes the frequency of the strain containing gnd(RM43A) after t generations of competi- tion. From the two points under each growth condition, estimate the fitness of the strain containing gndtRMi3A) relative to that containing gnd(RM77C) under the growth condition: Growth medium Pa Pis Gluconate Ribose 455 594 0.898 0.587 ANSWER In gluconate medium, log (0.898/0.102) = log (0.455/0545} + 35 x log w, and so log w = 0.0292, or w = 1.0696. Hence, the allele gnd(RM43A) confers about a 7% selective advantage in competition for utilization of gluconate. In ribose medium, w = 0.999, a value that is not significantly different from 1.0, and so the alleles appear to be functionally equivalent in this environment. (There were more than two points in the original data, and the estimates of fitness were based on the slope of the linear regression; here we have quoted only two data points for computational convenience.) Continuous Time Bacteri.il populations such as those in Problem 6.1 do not reproduce in dis- crete generations but instead they reproduce continuously. In a continuous model, the exponential population growth of A and B are governed by the equations dA(t)/dt = a'A{\) and dB(t)/dt = b'B(f), where a' and ¥ are the growth rates. Therefore, A(t) = A{0) ex P "'' and B(t) = B(0) exp " (Chapter 1), and so M0 B(t) M0)_^. by = M0) 6.4 B(0) B(0) Equation 6.4 means that, in a continuous population, the outcome of selection depends on the difference between the exponential growth rates a'-b', which is represented by the symbol m on the right-hand side, The value of in also measures the relative fitness of strain A relative to strain B, but in a Darwinian Selection 21 7 continuously reproducing population Comparing Equation 6.4 with Equa- tion 6.1 yields the relation between m and w m = In w 6 5 In other words, the relative fitness with continuous growth in equals the natural logarithm of the relative fitness with discrete reproduction w Selec- tive neutrality means that to - 1 or that m = 0. For the values ol w estimated in Problem 6.1, the corresponding values of m are 0.0673 and -0.001, respective- ly. If w is not too different from 1, then m = w - 1 is a reasonable approximation. Change in Allele Frequency in Haploids Although the discrete and continuous models are completely equivalent under the transformation in Equation 6 5, the equations for change in allele frequency look rather different. In the discrete model, the change in the fre- quency of strain A in generation t is given by the difference p t - p M , which can be calculated in terms of p M from the formulas in Table 6.1. The differ- encepj-pM is usually symbolized A/rand, for simplicity, the subscript I - 1 is suppressed. Using the expressions in Table 6.1 and the fact that q = 1 - p, we obtain Ap = pw _ pq(iv~1) pxv + q pw + q 6.6 Not surprisingly, p increases if the relative fitness of A is greater than 1 and decreases if the relative fitness of A is smaller than 1. If the relative fit- nesses of A and B are equal, then p does not change — provided that the pop- ulation size is very large (theoretically, it has to be infinite) The analog of Equation 6.6 in a continuous model contains the derivative dp/dt in place of Ap. This we can obtain from Equation 6.4 with a little trick- ery. Because A(t)/B(t) equals p(t}/q(t), the derivative of Equation 6.4 with respect to r must equal the derivative of p(l)/q{t) with respect to t. For sim- plicity, we will write p and q instead of p(f) and q(l). The derivative of Equa- tion 6.4 with respect to t equals mp/q and the derivative of p/q with respect to t equals (1 /q 1 ) x dp/dt. Setting these expressions equal to each other and solving for dp/dt, we obtain dp 6.7 Voila! There is no denominator! What happened to it? In a technical sense, it disappeared into the difference between the discrete model and the contin- uous model. In a practical sense, the absence of a denominator in Equation 6 7 greatly simplifies some of the formulas to come, especially those con- 218 Chapter 6 corned with random genetic drift in Chapter 7 Although they look very dif- ferent. Equations 6.6 and 6.7 are merely different ways of saying the same thing. In this chapter, we will deal mainly with expressions analogous to Equation 6.6 because they are more easily derived for various types of selec- tion. However, when it is necessary to dispose of a troublesome denominator, we will invoke the continuous model in Equation 6.7 and be rid of it. Darwinian Fitness and Matthusian Fitness The distinction between the fitness parameters in the discrete and continuous models has been incorporated into the terminology of population genetics in the terms Darwinian fitness, which refers to the discrete model, and Malthusian fitness, which refers to the continuous model. The latter is named after Thomas Mai thus (1766-1834), whose views on the implications of continued population growth strongly influenced Darwin's thinking on the subject. A Darwinian fitness is conventionally represented by the symbol xv, often embellished with a subscript, and Malthusian fitness is convention- ally represented by the symbol in. In this book, the lerm/i'f ness, when used without qualification, will mean Darwinian fitness unless it is clear from the context thai some other meaning is intended. SELECTION IN DIPLOID ORGANISMS In diploid organisms, the consequences of selection are most conveniently explored under the model of random mating in Chapter 3, but incorporating selection by permitting the (itnesses of the genotypes to differ. Selection is assumed to take place on the diploid genotypes. We shall use the conven- tional symbols w n , w i2r and <i> 22 to represent the Darwinian fitnesses of the genotypes AA, An, and aa, respectively. The simplest way to interpret the fit- nesses is in terms of survivorship, usually termed viability, which is the probability that a genotype survives from fertilization to reproductive age. If the fitness of each genotype is set equal to its probability of survivorship, then each fitness is an absolute fitness because its value is independent of the fitnesses ol the other genotypes. In practice, we usually know only the value of the viability of each genotype relative to that of another genotype chosen as the standard ot comparison. When a fitness value is expressed rel- ative to that of another genotype, the fitness is a relative fitness. The relative fitness of the genotype chosen as the standard of comparison is arbitrarily assigned the value 1. To consider a specific example, suppose that the genotypes A A, Aa, and aa have probabilities of survival from conception to reproductive age of 0.75, 75, and 0.50, respectively. These are the absolute viabilities of the geno- types. They can be judged realistic or not only if we specify the organism They may be plausible values if the organism is a mammal or a bird because each offspring has a reasonable chance of survival, but implausible if the organism is an insect or an oyster because, in these organisms, most Darwinian Selection 219 newborns are destined not to survive. Because selection depends on the rela- tive magnitudes of the viabilities, it is usually most convenient to express the viabilities in relative terms. Taking genotype AA as the standard, the relative viabilities of AA, Aa, and aa are 0.75/0.75, 0.75/0.75, and 50/0 75, or 1 0, 1.0, and 0.67, respectively. Equivalently, we could choose genotype aa as the stan- dard, in which case the relative viabilities are 75/0.50, 0.75/0.50, and 0.50/0.50, or 1.5, 1.5, and 1.0, respectively. Usually, the relative viabilities are calculated so that the largest relative viability equals 1.0 The relative viabili- ties are equal to the relative fitnesses of the genotypes provided that the genotypes are equally capable of reproduction. Viabilities expressed in rela- tive terms are as valid for osprey as for oysters because the relative fitnesses are the same whether the absolute fitnesses are 0.75, 0.75, and 0.50 or 00075 0.00075, and 0.00050. Change in Allele Frequency in Diploids If we write the allele frequencies of A and a as p, and q„ respectively, in gen- eration f, then it is straightforward to derive expressions for the allele fre- quencies in generation t m terms of the allele frequencies j? M and q t . , in the previous generation. The subscripts I and f - 1 are rather cumbersome to carry along in equations, so we will use the symbols p and q for p,_ , and q,_ u and the symbols p' and q' for p t and q t . The relation between the allele frequencies in two consecutive generations is deduced in Table 6.2, where the fitnesses w n , n> 12 , and w 22 are the relative viabilities. In generation f - 1, the genotype frequencies of AA, Aa, and aa TABLE 6.2 DIPLOID SELECTION FOR SURVIVORSHIP (VIABILITY) Generation r - 1 Frequency before selection Relative fitness (viability) After selection Normalized Genotype Total AA An no f 2 V q q 2 1=;> 2 42^ + rf H'll «»12 11 '22 A'n 2pqiv v <fw 22 ?7' = /r f j'„ + 2jiqw v -npP ?2 P 2 "'n 2pqw v q ? w 22 Generation/ q = p 2 u>„ +t>qw n _ IW"\2 + >f'<'21 Note The allele frequencies p and q are those in gametes immerlr.ilelv prior to ferlilt/ation trio AA, Aa nnd ,„t zygotes survive to reproductive maturity in the ratio rr„ t<v "'" A H Reneilvpes, ns .xkilK, are assumed to nave the same reproductive capacity 220 Chapter 6 among newly fertilized eggs are given by p""', 2pq, and (f, respectively, assum- ing random mating By definition, newly fertilized eggs survive in the ratio K'n : ii)| 2 . w 22 , and so the ratio of AA : An mi among surviving adults is p 2 w u :2pqw u 1 2u '2i To proceed, we need to convert the terms in the above expression into rel- ative frequencies by dividing each term by the sum. The value of the sum is indicated in Table 6.2 as w = p 2 w u +2pqu>\ 2 +q 2 u> 22 6.8 The symbol w is the average fitness in the population in generation f - 1. Division of each term in the ratio of survivors by u> yields the genotype fre- quencies among adults- AA: p-w n Aa: IpqiVi aa- iflV 22 6.9 Among the surviving adults, the AA genotypes produce all A gametes, the An genotypes produce V 2 A and ] / 2 a gametes, and the aa genoK \ >es pro- duce all a gametes. Hence, the frequencies of the gametes that unite at ran- dom to form the zygotes of the next generation are: A ,_ p 2 w u +pqu> ]2 w a- q = _ }Wn + q 2 ^27 6.10 These are the relations we were after because they express the allele fre- quencies in any generation in terms of the allele frequencies in the previous generation. From these equations, the outcome of selection can be deduced. As in the haploid model, it is often useful to know A;', which is the differ- ence in allele frequency p' - p resulting from one generation of selection. Subtraction of/? from the expression for p' in Equation 6.10 and a little manip- ulation leads to: Ap _ P#("'ll -«»12) + <?(W|2 -ttte)] 6.11 Equation 6.11 is the diploid analog of that in the haploid model in Equa- tion 6.6. At this point, an example of the use of these equations is in order. We will use data on the change in the frequency of the Cy {Curly wings) allele in a laboratory population of Drosopliila mclmiogastcr, which are plotted in Figure 6.2. The Ci/ allele is lethal when homozygous, so w u = 0. The points in Figure 6.2 pertain to the frequency of Cy heferozygotes but, because Cy/Cy geno- Darwinian Selection 221 2 3 4 5 Time (f, in generations) Figure 6.2 Change in frequency of ad ult Drosoplula im'lniwgaster heterozygous for the dominant mutation Cy (Curly wings) in an experimental population The genotype Cy/Cy is lethal. The curve represents the theoretical change in fre- quency when the ratio of viabilities of Cy/+ to +/+ is 0.5 ■ 1 (Data horn Teissier 1942. The fitness value of 0.5 was estimated by Wright 1977.) types do not survive, the allele frequency p of Cy equals one-half the fre- quency of Q//+ adults. The points in the figure are each separated by one generation, and the initial generation has a frequency of Ci//+ adults of 0.67, hence p = 0.335 and thus q = 0.665. Wright (1977) has studied these data and concluded that w l2 = 0.5 for Cy/+ genotypes, relative to a value of w 22 - 1 tor +/+ genotypes. Substituting these values for/?, q, iv n , u>\ 2 , and w 22 into the expression for p' in Equation 6. 10 yields V = 0.335 3 x + 0.335 x 0.665 x 5 0.335 2 x 4- 2 x 0.335 x 0.665 x 0.5 + 6h5 2 x 1 = 168 Therefore, the predicted frequency of Cy/+ adults in the generation 1 is 2;/ = 0.336, which is reasonably close to the observed value of 368. 222 Chapter 6 PROBLEM 6.2 Assume a value of p = 0.1 68 for the frequency of the Cy allele in generation 1 in the population in Figure 6.2. Calculate the expect- ed frequency of Q//+ heterozygotes among adults in generation 2. ANSWER In this case, p' = [0 168 2 x + 0.168 x 0.832 x 0.5] / w, where w = 0.168 2 x + 2 x 0.168 x 0.832 x 0.5 ■+ 0.832 2 x 1.0 = 0.832; hence p' = 0.0699/0.832 = 084. The expected frequency of Cy/+ adults is If = 0.168. This result is very close to the observed value of 0.165. The theoretical curve in Figure 6.2 was calculated using the same genera- tion-by-generation algorithm. We make a slight digression to point out thai it is sometimes convenient to think in terms of the marginal fitnesses of the A and a alleles. The mar- ginal fitness equals the average fitness of all genotypes containing A or a, respectively, weighted by their relative frequency and the number of A or a alleles they contain. For example, A alleles are found in AA and Aa genotypes in the proportions p and q and, therefore, the marginal fitness ii>\ of /\ -con- taining genotypes equals pzt'u + r/?i' 12 . Similarly, the marginal fitness of fl-con- taining genotypes is r7> 2 = pw ]2 +• <7">22- The expression for p' in Equation 6.10 thus becomes p' = pii>\/w, and Equation 6.11 becomes Ap - p(w\ -w)/w. This expression makes it clear that any allele increases in frequency if the margin- al fitness of genotypes containing the allele (w{) is greater than the average fitness in the population (fir). This approach also generalizes readily to multi- ple alleles' for an allele with frequency p, and marginal fitness w„ the change in frequency in one generation equals 6.12 Time Required for a Given Change in Allele Frequency I laving dei ived Equation 6.1 1 for Ap resulting from one generation of selec- tion, it is an appropriate next step to express p, in terms of p<>, as we did in Chapter 5 for the analogous equations involving mutation and migration, For any specified values of the initial allele frequencies and the fitness parame- ters, the allele frequencies can be determined generation after generation by computer iteration, as in Problem 6.2 More generally one might want an explicit mathematical formula for/), in terms of p n , but Equation 6 11 does not lend itself to analytical solution Darwinian Selection 223 There is an alternative approach based on a continuous model, however If the fitnesses are expressed as Malthusian fitnesses rather than as Darwin- ian fitnesses, then the analog of Equation 6.11 for a continuously growing population is dp dt = pq[p(m n - m 12 ) + q(m u - m 22 )\ 6.13 where the values of m are the malthusian fitnesses. Note that there is no denominator in Equation 6.13 because it disappeared in the same way as the denominator in Equation 6.7. A less elegant way to derive an equation like Equation 6.13 is to suppose that the Darwinian fitnesses are all quite close to 1; then the change in allele frequency is slow enough that Ap = dp/dt and, fur- thermore, it) = 1. Under these conditions, Equation 6.11 takes the form of Equation 6.13 with the m values replaced with w values. To solve Equation 6.13, the terms are rearranged to isolate those in p on one side and those in f on the other, then one side is integrated over p from p to p,, and the other integrated over f from to f. The details are left as an exer- cise. The answers are most easily presented if we change the symbols For this purpose, we rewrite the fitnesses of the genotypes as follows: w n = 1 w n =l-hs iv 2l = 1 - s m X \ = m 12 = -fis m 22 = -s where the Malthusian fitnesses follow from the approximation m tj = w, } - 1 when w, } = 1. Use of the h and s symbols for the fitnesses has the advantage of making the amount of selection and the degree of dominance explicit. If s is positive and h is not negative, selection favors genotypes carrying the A allele. In this context, s is called the selection coefficient against the aa genotype, and h is called the degree of dominance of the a allele. For exam- ple, when h - 0, the Darwinian fitnesses of AA, Aa, and aa are 1, II, and 1 - s, respectively, and a is completely recessive to A. Alternatively, when h - 1, the Darwinian fitnesses are 1, 1 - s, and 1 - s, respectively, and a is completely dominant to A. In terms of the selection coefficient and the degree of domi- nance, dp/dt of Equation 6.13 becomes -£ = pqs[ph + q(\-h)] at 6.14 The following equations give p, in terms of p in three cases of importance. • A is a favored dominant In this case h - 0. Then dp/dt = pq 2 $, and if i J ft Uoj q» 6.15 * A is favored and the alleles are additive in their effects on fitness. Addi- tive effects on fitness means that the fitness of the heterozygote is exactly 224 Chapter 6 intermediate between the fitnesses of the honuvygotes, and so // = '/.. The additive case is also referred to as semidominance or as genie selection When // = Vi, then dp/dt = fup/2, and /'' - '"^j= h n*ru" Fit 6.16 Note that Equation 6.16 for additive alleles is similar in form to Equation 6 3 for haploid selection when w - 1 + s/2 and s is small. In other words, slow selection of additive alleles in a diploid species is mathematically almost equivalent to selection in a haploid sp ecies. In Problem 6.3, you will see that the precise requirement is zt> l2 = ^(wuii'n)- A is a favored recessive. In this case, h = 1, so dp/dt = p 2 qs, and % ) Pi + sr 6.17 Some of the practical implications of these equations are explored below. Problem 6.3 explores a little more deeply the relation between selection in haploid species and selection in diploid species. Figure 6.3 illustrates the changes in allele frequency for Equations 6.15 through 6.17. PROBLEM 6.3 The discrete model of selection in a haploid species is completely equivalent to that in a diploid species it in the diploid, the Darwinian fitness of the heterozygote equals the geometric mean of the D arwinian fitnesses of the homozygotes — that is, if w }2 - ^(wiiWiz)- Show that, in this case, Equation 6.3 for &p in a haploid species is, indeed, identical to Equation 6.11 for Ap in a diploid species. What is the equivalent value of w in the haploid in terms of the Darwinian fitnesses in the diploid? ANSWER Substitute W\<i = Vw^Vw^ into Equation 6.11. The num- erator simplifies to pq x (vo^ - 4w&) x (p4u^ } + q~Jwn). The denomi- nator simplifies to (pVw[j + ijVmJ^) 2 . Therefore, ZT\ W\ Ap = v4 w n +flVw22 '•£♦' Darwinian Selection 225 to r Additive \ 08 \ Dominant 06 - 04 / / / * ,.... .... Recessive ^ / 02 l- -i ■ 1 1 1 1 1 1 . 1 i.j 100 200 300 400 500 600 700 800 W0 1000 1100 1200 Number of generations Figure 63 The change in frequency p of a favorable allele that is either domi- nant, additive, or recessive in its effect on fitness. The frequency of a favored dominant allele changes most slowly when the allele is common, and the fre- quency of a favored recessive allele changes most slowly when the allele is rare. In all three examples, the difference in relative fitness between the homozygous AA and aa genotypes is assumed to be five percent. This is in exactly the same form as Equation 6.6 with w = iwn/w^. Taking w^ = 1 as the standard, w in the haploid model equals the fit- ness of the heterozygote in the diploid model. More specifically, let w n = (1 + s/2) 2 , w u = 1 + s/2, and w n = 1. If s is small compared to 1, then Wu = (1 + s/2) 2 sl + s, which implies that the Darwinian fitness- es are approximately additive. Furthermore, &p = pqs/2 r which has the same form as dp/dt in the additive case leading to Equation 6.1 7. PROBLEM 6.4 A certain highly isolated colony of the moth Panax- ia dominula near Oxford, England, was intensively studied by Ford and collaborators over the period 1928 to 1968 (Ford and Sheppard 1969). This colony contained a mutant allele affecting color pattern. The frequency of the mutant allele declined steadily over the period 1939 to 1968. Indeed, the accompanying steady increase in the fre- quency of the normal allele followed Equation 6.16 for additive genes with s = 0.20 (Wright, 1978, shows a graph). The species has one generation per year, and the estimated frequency of the mutant allele in 1965 was 0.008. (This value is actually the average for the 226 Chapter 6 seven-year period 1962 to 1968.) Estimate the frequency of the mutant allele in 1950 and in 1940. ANSWER Here we are given q t and want to use Equation 6.16 to estimate q . Between 1950 and 1965, there were I = 1965 - 1950 = 15 generations. We are given q, = 0.008, hence p, = 0.992 and In (0.992/ 0.008) = 4.820. Thus, 4.820 = In (p /<?o) + (0.20/2) x 15, or In (p /q Q ) = 3.32. Then p i} /q = 27.660, or p = 0.965 and q a = 0.035. For the year 1940, t = 1965 - 1940 = 25 generations, from which p = 0.911 and q = 0.089. {You may be interested to know that observations made at the time yielded estimates of q - 0.037 in 1950 and 170 = 0.111 in 1940.) Application to the Evolution of Insecticide Resistance Some of the most dramatic examples of evolution in action result Irom the natural selection for chemical pesticide resistance in natural populations of insects and other agricultural pests. In the 1940s, when chemical pesticides were first used on a large scale, an estimated 7% of the agricultural crops in the United States were lost to insects. Initial successes in chemical pest man- agement were followed by gradual loss of effectiveness Today, more than 400 pest species have evolved significant resistance to one or more pesticides, and 1 37.. of the agricultural crops in the United States are lost to insects (May 1985). In many cases, significant pesticide resistance has evolved in 5 to 50 generations irrespective of the insect species, geographical region, pesticide, frequency and method of use, and other seemingly important variables (May 1985). Equations 6.15 through 6.17 help to understand this apparent paradox because many of the resistance phenotypes result from single mutant alleles. The resistance alleles are often partially or completely dominant, so Equa- tions 6.15 and 6.16 are applicable. Prior to use of the pesticide, the allele fre- quency /»D of the resistant mutant is generally close to 0. Use of the pesticide increases the allele frequency, sometimes by many orders of magnitude, but significant resistance is noticed in the pest population even before the allele frequency p t increases above a few percent. Thus, as rough approximations, we may assume that 170 and q, are both close enough to 1 that In (pn/ifr) = ' n Po and In (p l /q,) ~ \h- Using these approximations, Equation 6.16 (additive case) implies that t = (2/s) x In (p t /p<.') and Equation 6.15 (dominant case) implies that i = (1/s) x In (pi/po)- In many instances, the ratio p,/p may range from t xl() 2 to perhaps 1 x 1 7 , and s may typica I ly be 0.5 or greater. Over this wide Darwinian Selection 227 range of parameter values, the time / is effectively 1 united to a range of 5 to 50 generations for the appearance of a significant degree of pesticide resistance. Details in actual examples depend on such factors as effective population number and extent ol genetic isolation between local populations An exam- ple of the global spread of an insecticide-resistance allele is given in Chapter 8. The evolution of resistance caused by multiple interacting alleles may be expected to take somewhat longer than single-gene resistance. PROBLEM 6.5 In the discussion of the evolution of insecticide resis- tance, we used the approximation r = (1 /s) x In {pjp n ) for the domi- nant case and r s (2/s) x In (p t /p ) for the semidominant case. Evaluate the adequacy of the approximations for the values in the accompanying table by comparing them with the more exact values calculated from Equations 6.15 and 6.16. Example no. Po Pt 1 1 x 10- 4 0.01 0.50 2 lxitr 4 0.10 50 3 lxhT 4 0.50 050 4 1 x irr 7 0.10 0.50 5 ixitr 4 0.10 0.20 ANSWER The approximations are quite acceptable for the exam- ples. The more exact and approximate values are as follows: Example no. Eqn,6JS Approximation Eqn. 6.16 Approximation 1 9.3 9.2 18 5 18 4 2 14.2 13.8 28.1 27.6 3 20,4 17.0 36.8 341 4 28.1 27.6 55 7 55.3 5 35.6 34.5 70.1 69 1 EQUILIBRIA WITH SELECTION An equilibrium value of p in a discrete model is any value for which &p = 0. When the allele frequency is at an equilibrium in an infinite population, the allele frequency remains the same generation after generation Because real populations are finite in size, an allele frequency is subject to chance fluctua- tions and so cannot usually remain exactly at an equilibrium value bor any 228 Chapter 6 equilibrium, therefore, it is important to consider how the allele frequency behaves when it is close, but not exactly equal, to the equilibrium value Any equilibrium can be classified as one of several different types according to the behavior of the allele frequency when it is near the equilibrium: • An equilibrium is said to be locally stable if the allele frequency, when ii is already close to the equilibrium, moves progressively closer in subse- quent generations. A locally stable equilibrium may also be globally sta- ble. This term means that the allele frequency always moves toward the equilibrium regardless of where it starts, even if initially far away from the equilibrium. A polymorphism with a stable equilibrium is sometimes called a balanced polymorphism. • An equilibrium is unstable if the allele frequency, initially close to the equilibrium, moves progressively farther away in subsequent genera- tions. • An equilibrium is called neutrally stable or semistable if the allele fre- quency has no tendency to change regardless of its initial value. In such a case, every allele frequency represents an equilibrium because Ap = whatever the value of p. This type of equilibrium is exemplified by the Hardy- Weinberg principle in an infinite population (Chapter 3). The concepts of stability can be applied to the case of selection governed by Equation 6.11 in which A is the favored allele. For A to be favored, we need n> u > w u ^ H'zz* and at least one of the strict inequalities must be true. In such a case, there are only two equilibria, namely p = and p = 1 . Except for p = and p = 1, when Ap = 0, it is always true that Ap > 0. Hence, if p is close to 0, its value increases (moving it farther away from 0), and so the equilibri- um at p = is unstable. On the other hand, if p is near 1, it moves still closer to 1 (because Ap > 0), and so the equilibrium atp = 1 is locally stable. In this example, p eventually goes to 1 whatever its initial value, and so the equilib- rium atp = 1 is globally stable also. Overdominance With two alleles of a gene in a diploid organism, there is the possibility that the heterozygous genotype has the highest fitness or that the heterozygous genotype has the lowest fitness. These cases illustrate equilibria in which the equilibrium value of p is between and 1. Overdominance, also called heterozygote superiority, is the term applied when the heterozygote has a higher fitness than both homozygotes. Symbol- ically, heterozygote superiority means that w n > w u and simultaneously w u > o>22- With overdominance, p = and p = 1 are both equilibria because, according to Equation 6.11, Ap - at these values. There is also a third equi- librium made possible by the fact that p{w n - u> 13 ) + q[iv l2 - 11*22) can equal 0. The equilibrium frequency of A is conventionally denoted p; hence the equi- T Darwinian Selection 229 libnum allele frequency of « is 7 = 1 - p The equilibrium can be found by solving p(w n - w l2 ) + q(w n ~ w n ) = 0, from which a little algebra gives P = «'12 - H>22 2W|2-H>,, -W 2 2 6.18 Equation 6.18 is often encountered in another form in which the fitnesses are all expressed relative to that of the heterozygote by setting w n = 1 - s , w n = 1, and zv 22 = 1 - /. (This formulation is proposed at the risk of some con- fusion because f is now the selection coefficient against aa rather than the time in generations.) With these substitutions, Equation 6.18 becomes P = t s + t This relationship makes a lot of intuitive sense because it implies that greater selection against aa increases the equilibrium frequency p of A. The overdominance equilibrium in Equation 6.18 is globally stable where- as those at p = and p = 1 are unstable. The time course is indicated in Figure 6.4A, where the arrowheads show the direction of change in allele frequency. Figure 6.4B shows the change in w with overdominance. The average fitness 20 40 60 80 Time {l, in generations) Figure 6.4 Selection when there is overdominance (A) The allele frequencies converge to an equilibrium value irrespective of the initial frequency. In this example, w n = 0.9, w n - 1, and w 22 = 0.8, and the equilibrium frequency of the A allele, p, is 0.667. (B) Average fitness w against p for the same example. Note that w is a maximum at equilibrium. 230 Chapter 6 in the population is maximized at the stable equilibrium. Maximization of average fitness is a frequent outcome of selection in random-mating popula- tions with constant fitnesses. There are, however, many exceptions when mating is nonrandom, when the fitnesses are not constant, or when there are interactions between alleles of different genes (Ewens 1979; Curtsmger 1984). Note particularly that w is the average fitness in the population, not the aver- age titness of the population. The relative survivorships it),,, w n , and n> 22 are relevant only to the differential mortality of the genotypes within a popula- tion at any given lime. The average of the relative survivorships is the aver- age "fitness" w in the population. However, w has no necessary relation to vernacular meanings of "fitness" such as competitive ability, population size, production of biomass, or evolutionary persistence (Haymer and Hartl 1982). Although overdominance is one mechanism for the maintenance of pnlv- The classic case is sickle-cell anemia in human beings, which is prevalent in many populations at risk for the type of malaria caused by the mosquito-borne protozoan parasite Plasmodium falciparum (Figure 6.5). The anemia is caused by an allele S that codes for a variant form of the fS chain of hemoglobin In per- sons of genotype SS, many red blood cells assume a curved, elongated shape ("sRkling") and die iemo\ed horn uRiiiarion. I he result is a severe anemia as well as pain and disability owing to the accumulation of defective cells in the capillaries, joints, spleen, and other organs. In the absence of intensive medical at a i Ha 1 1 u'h high ficqueni v beiau^e peisonn of genotype AS, in which /\ is the nonmutant allele, have only a mild form of the anemia but are quite resistant to malaria, perhaps because red blood cells infested with the parasite undergo sickling and are removed from circulation. Homozygous AA people are not ane- mic but, on the other hand, are the most sensitive to severe malaria. The result ol the offsetting sickle-rell anemia and malaria resistance is that the hetero/y- gotes have the highest fitness. In regions of Africa in which malaria is common, the viabilities of AA, AS, and SS genotypes have been estimated as w n = 0.9, u'| 2 = 1, and K'22 = 2, respectively (Cavalli-Sforza and Bodmer 1971; Templeton 1982) Substitution into Equation 6.18 leads to a predicted equilibrium allele fre- quency for A of p - 0.89. Consequently, that of S is 0.11. This value is reasonably close to the average allele frequency of 0,09 across West Africa, but there is con- siderable variation in allele frequency among local populations. PROBLEM 6.6 Experimental populations of Drosophifo pseudoobscu- ra were periodically treated with weak doses of the insecticide DDT. One population was initially polymorphic for five different inversions Darwinian Selection 231 fn^T 6 '1 I^Tj" 111 y ' a >' areas show the incidence ol falciparum malaria cont 7n ' ddle EaSt T ld SOUthem Eur °^ in the m ^ Wi-e mo o control programs were implemented. The light gray areas are regions w high madence of srckle-cel! anemia. The extensive overlap in ihldK'hib tons of the thtrd chromosome. After 13 generations, three of the inversions had essentially disappeared from the population. The two that remained were Standard {ST) and Arrowhead (AR). Changes in fre- quency of each inversion were monitored and, from the values for the Jd/VT 6 senerations ' the Native fitnesses of ST/ST, ST/AR and frf™ 8e ™ typeS Wefe estimated ^ 0.47, 1.0, and 0.62, respectively (DuMouchel and Anderson 1968). Because the inversions undergo almost no recombination, each type can be considered as an "allele " What equilibrium frequency of ST is predicted? What equilibrium value of w is predicted? 232 Chapter 6 ANSWER From Equation 6.18, p = (1.0 - 0.62)/ (2.0 - 0.47 - 0.62) = 0.42, (The observed value after 13 generations was 0.43.) The predict- ed equilibrium value of w, from Equation 6.8, equals 0.422 x 0.47 + 2 x 0.42 x 0.58 x 1.0 4 0.58 2 x 0.62 = 0.78. PROBLEM 6.7 Warfarin is a blood anticoagulant used for rat control in World War II and afterward. Initially highly successful, the effec- tiveness of the rodenticide gradually diminished owing to the evolu- tion of resistance among some target populations. Among Norway rats in Great Britain, resistance results from an otherwise harmful mutation R in a gene in which the normal nonresistant allele may be denoted 5. In the absence of warfarin, the relative fitnesses of SS, $R> and RR genotypes have been estimated as 1.00, 0.77, and 0.46 respec- tively. In the presence of warfarin, the relative fitnesses have been esti- mated as 0.68, 1.00, and 0.37, respectively (May 1985). The reduced fitness of the RR genotype appears to result from an excessive require- ment for vitamin K. Calculate the equilibrium frequency q of R in the presence of warfarin. Noting that, in the absence of warfarin, R and S are very nearly additive in their effects on fitness, estimate the approximate number of generations required for the allele frequency of R to decrease from q to 0.01 in the absence of the poison. ANSWER From Equation 6.18, the equilibrium frequency p of S equals (1.00 - 0.37)/(2 - 0.68 - 0.37) = 0.66, and so q of R * 0.34. Set- ting ifo = 034 and q t = 0.01 in Equation 6.16, with t = 1.00 - 0.46 =,0.54, yields f - 14.6 generations. (The approximation fs very good wen though s is large; the exact value is 14 generations.) Local Stability Although the curves in Figure 6.4A indicate that Ihe interior equilibrium is locally stable when there is overdomi nance, an alternative approach is also applicable to the analysis of local stability in models of much greater Darwinian Selection 233 complexity. It is based on the expression for Ap in Equation 6. 1 1 R> empha- size that A;> is a function of p, we will write it as an explicit function, A(p). The local stability of an equilibrium depends on the behavior or A{f>) for a value of p close to, but not equal to, the equilibrium, as illustrated in Figuie 6.6 It is convenient to write A(p -t e) as the change in allele frequency when the start- ing point is a small deviation, e, from any allele frequency p The function A{p + e) can be expanded term by term into an infinite sum A0, + f ) = A (/ ,) + ^ + ^^ + ^M^ + ... dp dp 2 2! <y 3' The mathematical basis of this lype of expansion is beyond the scope of the book. If you are unfamiliar with it and want to look it up, you will find it under the heading the Taylor series m most textbooks of calculus. It is named after the mathematician Brook Taylor (1685-1731). The value of the Taylor series expansion is that, when e is sufficiently small, then all terms in e 2 and higher can be ignored. Therefore, for any value 02 &p a -0 1 - -0 2 -0 3 L Stablc equilibrium 0.2 /'o 1)4 /'i Vz Oft Pi = /><> + <V'< Figure 6.6 The change in allele frequency Ap plotted as a function of allele fre- quency p for a case of overdominance in which w u = 0.6, w l2 = I, and H'2? = 0.2 Starting with an allele frequency />„, smaller than the equilibrium value, the positive value of Ap„ indicates that the allele frequency in the next generation, />|, will be greater than />„ because ;j, = /j„ + &f,„. At an allele frequen - cy of /*,, the value of A/>| is also positive, and so p 2 is greater than p, because Pi = p, + A/i,. The steady increase continues until the population arrives at the equilibrium point p. The same logic shows that, starting with an initial allele frequency greater than p, the allele frequency decreases in each succeeding gen- eration and ultimately converges to the equilibrium from the other side 234 Chapter 6 of p, we can approximate A(p + e) in terms of A(p) itself and its first derivative. Furthermore, if p is one of the equilibrium points, then A(p) = by definition, and so the sign of A{p + e) depends on the sign of first derivative of A(p) eval- uated at the equilibrium in question. By definition, an equilibrium is locally stable if the allele frequency, starting at a point near the equilibrium, moves ever closer to the equilibrium. In symbols, this means that A(p + e) < if e > and A(p 4 e) > if £ < Therefore, any equilibrium point, denoted genencal- ly as p, is locally stable if, and only if, dA(p) dp <0 6.19 where the vertical line and p mean that the derivative should be evaluated at the equilibrium in question. In practice, calculating the derivative of A(j?} can be quite tedious without the use of computer software like Mathematica to do the algebraic manipula- tions. The result of differentiating Equation 6.11 is that pqiv [ (q-p)(p-p)w 2pq{p-p) 2 w 7 ;tt2 where w = w n - 2w n + iv n . With overdominance, iv < 0. Note that, when dA(p)/dp is evaluated at p = or p = 1 , both the first and last terms equal 0; when it is evaluated at p = p, the second and last terms equal The stability analysis proceeds as follows: • At p = 0, sign \dA(p)/dp] = -sign (w) > 0; • At p = p, sign [dA(p)/dp] = sign (w) < 0; • At p = 1 , sign [dA(p)/dp] = -sign (w) > 0. Therefore, as is already clear from Figure 6.4A, the equilibrium points at 0, p, and 1 are unstable, locally stable, and unstable, respectively. This stabil- ity analysis is predicated on the assumption of helerozygote superiority, which implies that w < Exactly the same equilibrium points are present when there is heterozygote inferiority, but then w > 0, which means that the stability property of each equilibrium point is reversed. This situation is dis- cussed next. Helerozygote Inferiority Helerozygote inferiority means that the fitness of the heterozygous geno- type is smaller than that of both homozygotes: w n < Wn and Wn < "'22- An interior equilibrium, given by Equation 6.18, exists in this case also. The analysis in the previous section indicates that this equilibrium is unstable, whereas the equilibria at p = and p = 1 are both locally (but not globally) sta- Darwinian Selection 235 20 40 60 80 100 Time (t, in generations) Figure 6.7 Selection when there is heterozygote inferiority. (A) The allele frequency goes to or 1 depending on the initial frequency. In this example, w u = 1, w u = 0.8, and m? 22 = 0.9, and there is an unstable equilibrium when the frequency of the ,4 allele \sp = 0.333. An infinite population with p = % main- tains this frequency, but any slight upward change in the frequency of A results in eventual fixation, and any slight downward change in the frequency of A results in ultimate loss. (B) Average fitness w against p for (he same example. The unstable equilibrium represents the minimum of tT' ble. An example of heterozygote inferiority is depicted in Figure 6.7A, where the arrows again denote the direction of change in allele frequency. If the initial allele frequency is exactly equal to the equilibrium value (in this exam- ple, p = V 3 ), then the allele frequency remains at that value. In all other cases, p goes to 1 or depending on whether the initial allele frequency was above or below the equilibrium value Figure 6.7B shows the change in average fitness. The unstable equilibrium at p = y, i s the minimum average fitness. The shape of the r7> curve has an important implication that carries over to more complex examples. Imagine a population with an allele frequency near 0, at which iT> = 0.9, In terms of aver- age fitness in the population, the population would be better off if the allele frequency were near 1, because then w = 1 .0. However, as shown by the direc- tion of the arrows, the population cannot evolve toward p = 1. It cannot get through the "valley" because p ' = 6 is a locally stable equilibrium. The popu- lation has no way to escape from the" equilibrium even though, in doing so, it would eventually end up with a greater average fitness. This consideration 236 Chapter 6 would seem to limit the ability of natural selection to increase average fitness in such cases, but one way out o( the impass is suggested in the next section. The Adaptive Topography and the Role of Random Genetic Drift Any graph of w against allele frequency is called an adaptive topography. The simplest example is Figuie 6 7B In order to generalize the example, try to imagine an adaptive topography in many dimensions with w a function of the allele frequencies at many loci. In many dimensions, the adaptive topography is a complex surface upon which there may be "peaks" and "pits" and even "saddle-shaped" regions. The peaks represent locally stable equilibria. Even if natural selection changes the allele frequencies so as to move w to the top of some peak, the peak it perches on may not be the high- est peak that exists on the whole surface. However, as illustrated in Figure 6.7B, the population may become stuck there because the peak is a locally sta- ble equilibrium. By what process can a population stranded on a submaximal fitness peak get off the peak? To do so, it has to travel through a nearby valley to a place where natural selection can carry it to the top of an even higher fitness peak. This is something that natural selection acting alone cannot accomplish because it entails a temporary reduction in fitness. There is, however, a process that can accomplish the task— random genetic drift. In a sufficiently small population, the allele frequencies can change by chance, even producing a reduction in aver- age fitness. Theoretically, random genetic drift can shift a population from a locally stable equilibrium, through a nearby valley, and into a region where it is attracted by another locally stable equilibrium toward a higher fitness peak. Random genetic drift can therefore play a crucial role in evolution by allowing a population to explore the full range of its adaptive topography. This role of ran- dom genetic drift has been particularly emphasized by Wright (1977 and earli- er) in his proposed shifting balance theory of evolution. Additional discussion of the theory is found in this chapter's section on interdemic selection; see also Hartl (1979), Provine (1986), and Coyne et al. (1997) MUTATION-SELECTION BALANCE You may recall from Chapter 4 that outcrossing species typically contain a large amount of hidden genetic variability in the form of recessive, or nearly recessive, harmful alleles, each present at a low frequency. Now we can explain why harmful alleles are not completely eliminated. Selection cannot eliminate them because they are continually created anew through recurrent mutation. To be specific, suppose that a is a harmful allele of the wildtype A and that mutation of A to a takes place at the rate u per generation. Because the allele frequency of'?, which we call q, remains small, reverse mutation of a to A can safely be ignored. The calculation of p' carried out to obtain Darwinian Selection 237 Equation 6 10 is still valid, except that a proportion ii of A alleles mutate to a in each generation. Therefore, w To proceed further, it is convenient tn write the relative fitnesses as 6 20 w v = 1 w ]2 = 1 - Its .= !-.«: The value of s is the selection.coefiicient against the homozygous aa geno- types and ft is the degree ^dominance of the a allele. If ft = 0, then a is a com- ■ pTete recessive Because AA and Aa have an identical fitness. If ft = 1, then a is dominant because Aa and aa have an identical fitness. Semidominance means that ft = % In mutation -selection balance, we are concerned with harmful alleles that are near the recessive end of the spectrum, and so ft will usually be substantially smaller than 0.5. Equilibrium Allele Frequencies When selection is balanced by recurrent mutation, there is a globally stable equilibrium at an allele frequency of p, which is the value of p in Equation 6.20 for which p'= p. The equilibrium frequency of the harmful a allele is therefore q = l- p. There are two important cases. • When the harmful allele is a complete recessive (ft = 0), then v s 6 21 • When the harmful allele shows partial dominance (ft > 0), then, to an excellent approximation for realistic values of p, ft, and s, Use of these equations is exemplified by Huntington disease in human beings This severe inherited disorder is characterized by a degeneration of the neuromuscular system that typically appears after age 35. Although the disease itself results from a dominant mutation, the effects on fitness show only partial dominance owing to the late age of onset of the disease. Relative to a value of w n = 1 for the homozygous nonnuitant genotype, the fitness of the heterozygous genotype has been estimated as w i2 = 0.81 (Reed and Neel 1959). Homozygous mutant genotypes also have the disease, but they are so rare that the equilibrium frequency of the mutant allele is determined by the fitness of the heterozygote. Equation 6 22 with fts = 0.19 is appropriate in this example. If we knew either u or q, we could estimate the other. In a Michigan 238 Chapter 6 population, ? = 5 x lO"" 5 for the Huntington allele (Reed and Neel 1959). Assuming that the population is in equilibrium, we can estimate u from Equation 6 22 as p = 5 x 10 5 x 19 = 9.5 x 1 0" '\ This use of Equation 6.22 illus- trates one of the common indirect methods for the estimation of mutation rates in human beings. The degree of dominance of a harmful allele is a primary factor in deter- mining its equilibrium frequency. Harmful alleles held in mutation-selection balance are rare. Thus the great majority of harmful alleles are present in heterozygous genotypes. Because there are so many heterozygous geno- types, relative to homozygous mutant genotypes, even a small reduction in fitness in the heterozygote has a large effect in decreasing the equilibrium allele frequency. This effect is shown quantitatively in Figure 6.8, which depicts (j as a function of u/s and h. Note how the surface bends sharply upward at the far-right corner where ft = 0. The increase indicates that, for a given value of u/s, a completely recessive allele is maintained at a higher equilibrium frequency than a partially dominant allele. Furthermore, the surface drops sharply as h increases from 0, which means that even a small 02 io-» 05 Figure 6.8 Allele frequencies maintained at equilibrium by mutation-selec- tion balance At each point on the surface, the height is the equilibrium frequen- cy a of a harmful allele, given as a function of the mutation rate p (expressed in multiples of the selection coefficient s) and the degree of dominance h. Note that the surface bends sharply upward toward /; = 0, a characteristic that means that even a small degree of dominance results in a substantial decrease in the equi- librium frequency of the harmful allele. The u/s axis is easiest to interpret when the harmful allele is a lethal (s = 1). Darwinian Selection 239 degree of dominance can cause a large reduction in equilibrium frequency. In general, for realistic values of p, s, and h, the value of q is typically less than 01 . Therefore, although mutation-selection balance can account for low-frequency deleterious alleles, it cannot readilv account for a harmful allele with a frequency greater than 11.01. PROBLEM 6.8 To confirm for yourself that a small amount of dom- inance can have a major effect in reducing the equilibrium frequency of a harmful allele, imagine an allele that is lethal when homozygous (s - 1) in a population of Drasophila. Suppose that the allele is main- tained by mutation-selection balance with p = 5 x KT 6 Calculate the equilibrium frequency of the allele for a complete recessive and for partial dominant when h = 0.025. ANSWER For a complete recessive, q - Vp/s =V(5 x 10^) = 2.24 x 10 3 . For partial dominance, q = p/hs = (5 x ]0" 6 )/0.025 = 2.00 x 10~ 4 . With partial dominance, the equilibrium allele frequency is reduced more than tenfold, and the frequency of homozygous recessive genotypes at equilibrium is reduced more than a hundredfold. It is of interest that h = 0025 is near the average degree of dominance estimated for "recessive" lethals in Drosophila (Simmons and Crow 1977). The Haldane-Muiler Principle The Haldane-Muller principle, named after the geneticists ]. B S Haldane (1892-1964) and H. J. Muller (1890-1967), deals ivith the effect of mutation- selection balance on the average fitness of a population. Ignoring recurrent mutation, selection would be able to rid a population completely of a harm- ful allele. Then, ij = 0, and re = 1. Because of recurrent mutation, the equilib- rium frequency is greater than 0. When h = 0, the average fitness in the population at equilibrium equals 1 - r/"s = 1 - (p/s)s = 1 - p. The reduction in average fitness due to mutation therefore equals 1 - (1 - p) = u, which is called the mutation load When a is partially dominant, the mutation load is approximately 2p because the average fitness at equilibrium is 1 - Ifu'jhs - <f* = 1 - 2p. This result is obtained by ignoring terms in ff because they are so small. With or without partial dominance, therefore, the effect of recuirent mutation in reducing the average fitness in the population is independent of how harmful the mutation is. That the effect of recurrent mutation on 240 Chapter 6 average population fitness depends only on the mutation rate is the Haldane- Muller principle. The implication is that the harmful effect of an increase in the mutation rate is the same irrespective of whether the mutations produced are mildly detrimental or severely harmful. The effects of severe and mild mutations balance out because a more harmful mutation comes to a lower equilibrium frequency. MORE COMPLEX TYPES OF SELECTION Although the two-allele model of viability selection illustrates the possible outcomes of selection, it ignores many potential complications. For example, when the genotypes differ in fertility rather than survivorship, then the model of viability selection is inadequate except in special cases. Most muta- tions have pleiotropic effects; that is, they affect more than one phenotypic attribute of the organism. For example, a' gene affecting embryonic growth rate may also affect age at first reproduction. When the pleiotropic effects act in opposing directions (for example, increasing viability but reducing fertili- ty), the net effect on fitness may be quite small. As a result, mutations with offsetting effects on different components of fitness may remain segregating in a population for many generations. Additional complications arise because fitness is determined by many genes that interact with each other. Simple models of selection are valid only when the alleles interact in such a way that their effects on fitness are addi- tive or multiplicative across genes. Other complications result when the fit- nesses of the genotypes are not constant but variable in time or space. In this section we briefly examine a sample of more complex models. Many of the models are of interest because they can maintain genetic polymorphisms. Although the list is extensive, it is by no means complete. You should not try to memorize all the different types of selection. They are collected here only for ease of reference. Frequency-Dependent Selection Frequency-dependent selection takes place when fitness is a function of either allele frequencies or genotype frequencies. There is no restriction on the type of frequency dependence except that each Darwinian fitness must be nonnegative. A simple example that illustrates frequency dependence is one in which the fitness of each genotype decreases in proportion to its fre- quency with a constant of proportionality equal to c: AA- w n =\-q> 2 Aa: W l2 =^-2cpq aa: w 22 = \-cq 2 In this example, Ap = cpqiq - p)(p 2 -pq + (j 2 )/w, and so there are equilibria at p = 0, V 2 , and 1. (The factor p 2 - pq + q 2 does not have a root for p in the range [0, 1].) A curious feature of this type of frequency-dependent selection Darwinian Selection 241 is that, at equilibrium, w i2 is smaller than either re,, or w 22 , so there is het- erozygote inferiority; yet p = i/ 2 is a globally stable equilibrium and w is a maximum at this equilibrium. The peculiarities of this example are illustra- tive of frequency-dependent selection in general. Because the fitnesses can be any functions of allele or genotype frequency nearly anything can happen Density-Dependent Selection Density-dependent selection means that the fitnesses are functions of the population size. Models of density-dependent selection must explicitly include population size and population growth. With logistic growth of two haploid genotypes whose numbers at time t are A(t) and B(t), Equation 1 11 in Chapter 1 becomes dA{t) dt \ *M J dt \ K 2 Each genotype has its own intrinsic rate of increase (r, or r 2 ) and its own carrying capacity (K, or K 2 ), but they affect each other's growth through the total population size A(t) + B(t). At any time, the outcome of selection depends on the total population size. When the population size is much smaller then either K x or K 2 , then the right-hand factor in each growth equa- tion equals approximately 1, and so the selection is determined by the rela- tive values of r Y and r 2 . When the population size becomes approximately equal to the smaller of K x or K 2 , then the genotype with the smaller carrying capacity stops growing while the other continues, and so the selection is determined by the relative values of K x and K 2 . Interesting events happen when the selection for r favors one genotype and the selection for K favors the other, especially in situations in which stochastic factors also affect pop- ulation size or there is a time lag between population size and its affect on growth rate. For further information on these types of models, see Rough- garden (1979), May (1981), Bulmer (1994), and Cohen (1995). Fecundity Selection In fecundity selection, differences in fitness between the genotypes result from the differing abilities of mating pairs to produce offspring. Because both genotypes in a mating pair contribute to the total number of offspring, the number of fitness parameters potentially equals the number of distinct kinds of mating pairs. For two alleles of one gene, there are nine possible types of mating because reciprocal matings may differ in the expected number of off- spring; for example, the expected number of offspring from the mating Aa 9 x aa c? may differ from that from the mating Aa 8 x aa 9 . The presence of so many fitness parameters complicates the mathematical analysis. An analysis of selection based on individual genotypes, analogous to viability differences, is not possible unless the overall fecundi ly of any mati ng pair can 242 Chapter 6 be written as either the product or the sum of two parameters, one for each genotype in the mating pair. When this strong simplification does not hold, models of selection with fertility differences become rather complex (Ewens, 1979; Clark and Feldman 1986). Models in which differences in fecundity are combined with differences in survivorship can retain genetic polymorphisms even if there is directional selection in one or the other component of fitness. Age-Structured Populations Age-structured populations with overlapping generations present problems even more formidable than those caused by fecundity and survivorship dif- ferences in populations with discrete, nonoverlapping generations In each short interval of time, a new cohort of newborns comes into existence and, as it ages, the fate of each organism in the cohort is governed by the functions /( v), which is the probability of survival from birth to age x, and b{x), which is the probability that an organism of age x (actually in the infinitesimal age interval x to x + tix) reproduces. If the functions l{x) and b(x) maintain the same form over time, then it can be shown that the population eventually reaches a stable age distribution in which the number of organisms in each age group increases or decreases at a constant rate. At the stable age distribu- tion, the overall growth rate of the population is the value of m that satisfies the equation: 1= fc-<"I(x)h(x)ih (See Crow and Kimura, 1970, for a derivation.) For this value of m, dN/dt = mN, where N is the total population size. In an age-structured popu- lation, hi corresponds to the intrinsic rate ol increase denoted r in Equation 1.7 in Chapter 1. So far so good, but genetics complicates this situation enormously. Tf the l(.v) and b{x) functions differ for different genotypes, then the allele frequen- cies change through time As the allele frequencies change, so does the age structure, and the genotype frequencies in each age class may be different. The result is that the age structure may not become stable until selection reaches some equilibrium (possibly fixation) The sorts of complexities that can arise have been examined by Charlesworth (1980) Heterogeneous Environments and Clines Heterogeneous environments refer to models in which the relative fitnesses change according to the environment. The environmental heterogeneity may be spatial or temporal or both. Selection of this type can maintain polymor- phisms in the absence of overdominance. If each homozygous genotype is favored in a different subset of environments, then there can be marginal Darwinian Selection 243 overdominance, in which the heterozygous genotype has the highest fitness when averaged across all the environments, even though it is not the most fit genotype in any particular environment. In some cases, the relative fitnesses of the genotypes vary geographically across a more or less smooth environmental gradient, for example, according to latitude, altitude, aridity, or salinity. If sufficiently stable in time, a gradient of selection across a region can result in a gradient of allele frequency across the region. A geographical trend in an allele frequency is called a cline. An unusually extreme example of a cline is found in the hemoglobin-! 1 allele in the eelpout fish Zwirces mviparus, the allele frequency of which drops from a value of nearly 1 in the North Sea to a value of nearly in the Baltic Sea (Christiansen and Frydenberg 1974). In human aboriginal populations, there is a cline of increasing frequency of the allele /" in the ABO blood groups from Southwest to Northeast Europe. Although clines can result from selection— for example, when one geno- type is favored at one extreme of the environmental gradient but disfavored at the other extreme — clines can also result from other processes Migration is one possibility: differences in allele frequency in local populations at the extremes of the range may result from chance processes (for example, differ- ent founding populations), and migration of organisms from the extremes into the intermediate zone produces the cline. The strongest evidence that a cline results from selection is when a cline is reproduced in different locations along a similar environmental gradient. A example of parallel clines played out on a grand scale is found in the elec- trophoretic polymorphism of alcohol dehydrogenase (the Adh gene) in D. melanogaster. In Eastern North America, the frequency of the Aril/ allele increases as one goes north, whereas DNA polymorphisms flanking Adh show no such geographic trend (Berry and Kreitman 1993). The cline is shown in the upper part of Figure 6.9. The frequency of Adh' is correlated with cooler temperatures and less rainfall in the more northern latitudes In Australia, as shown in the lower part of Figure 6.9, the frequency of the Adh' allele increases as one goes south (Oakeshott et al. 1982). This pattern is in apparent contradiction to that in Eastern North America but, because Aus- tralia is in the Southern Hemisphere, the clines are actually parallel Both show an increase in the frequency of Adh' as one proceeds from the equator toward the polar cap— the North Pole in the Northern Hemisphere and the South Pole in the Southern Hemisphere. On a much smaller geographical scale, in mountainous regions, the frequency of the Adh' "allele shows a clinal increase with altitude, which is again correlated with cooler temperature and less rainfall. Data from the Caucasus Mountains (Grossman et al. 1970) have been discussed in Problem 4.2; parallel clines have also been studied in the mountains ol Mexico (Pipkin et al. 1976). 244 Chapter 6 9 30 Eastern North America 4 06 0.8 1 12 Frequency of Adh' (arcsin >/p) 14 Australia -50 L Fiqure 6.9 Parallel dines of the Adh' (alcohol dehydrogenase fast) allele in East- ern North America and in Australia. The allele frequency is given as arcsin(>/p), where p is the allele frequency of Adh r . The angular transformation stretches the scale near the extreme values of p: for values of p - .1, 0.5, and 9, the values of arcsinf^) are 0.322, 785, and 1 .249, respectively, where the angles are mea- sured in radians The angular transformation is often used for proportions because it separates the variance of an estimate from the estimate itself: for a binomial proportion phased on n observations, the variance of p isp(l -p)/n, whereas the variance of arcsin(^p), with the angle expressed in radians is approximately 1/4h. (North American data from Beiry and Kreitman 1993; Aus- tralian data from Oakeshott et al. 1982 ) Diversifying Selection The term diversifying selection refers narrowly to selection that favors extreme pbenotypes. In a normal distribution of phenotypes, for example, diversifying selection means that organisms in the tails of the distribution are favored relative to those in the middle. More generally, diversifying selection refers to any type of selection in which genotypes are favored merely because they are different. Genes under diversifying selection tend to maintain a Darwinian Selection 245 relatively large number of alleles. Examples include genes of the major histocompatibility complex in mammals, in which the srl- ., ., L > agent is thought to be through resistance to parasitic microorganisms (Satta et al 1993) and bacterial genes that produce toxins (cnlicins) that kill other bacte- ria, in which the selective agent is the destruction of competitors (Riley 1993 Ayala eta 1,1994). Some plants have genes for gametophytic self-incompatibility, in which a pollen grain that carries any self-incompatibility allele is unable to pollinate a plant that carries the same allele Self-incompatibility of this type implies that no plant can fertilize itself. Because a plant of genotype S,S, can produce only S, and S, pollen, the pollen cannot fertilize S,S ; plants. Furthermore, homozygous genotypes are not normally lound because their formation would require that S, pollen fertilize an S.S, plant. It is easy to show that there is positive selection for new self-sterility alleles and that, at equilibrium, every allele has the same frequency. For u alleles, if S, has frequency p„ then the frequency of Sfij genotypes with random mating is 2p,(l - p,)/(l - Zp, 2 ). The denominator is necessary because of the absence of homozygous genotypes. The probability that an S, pollen can be successful in fertilization is therefore the probability of genotypes other than S,S,, which equals 1 - 2p,(l - p,)/n - Ep, 2 ). At equilibrium, we must have p,f I - p) = p,(l - p y ) From these expressions follow some important conclusions summarized in Problem 6.10. For more information on gametophytic self-incompatibility systems, see loerger et al (1991 ) and Uyenoyama (1995). PROBLEM 6.9 Show that p,{l - p,) = p/1 - p,) for all i and; implies that pi = P, = 1/rc, where n is the number of self-incompatible alleles and n 2 3. Use these equilibrium allele frequencies to show that the probability that a pollen grain lands on a compatible style equals (« - 2)/n. Finally, show that the probability of successful fertilization by a new mutant S allele, relative to that of any preexisting allele, equals n/(n - 2). ANSWER p,(l ~ p,) = p/1 - P/ ) implies that p t -p t = p, 2 - p}= (p, - Pf ) (p, + pj) so that either p, = Pj for all i and ;' or p, + p ; = 0. Because n > 3, (p; + pj) * 1. Because there are n alleles, we must have lp, - 1, and so p, = l/n. The probability of a pollen grain landing on a compatible style is 1 - 2p,{l ~ p,)/(l - 2p ( 2 )= 1 - 2/ n = (m - 2)/n. A pollen grain contain- ing a newly arising S allele will always land on a compatible style, 246 Chapter 6 and so its probability of fertilization, relative to that of a preexisting allele, equals 1 /[(« - 2)/n] = n/(n - 2). If effect, this is the relative fitness of a new mutation. For n = 3, 4, 5, 10, 50, and 100, it equals 3, 2, 1.67, 1.25, 1.04, and 1 .02, respectively. Differentia! Selection in the Sexes Some genes may have different effects in the two sexes. If the fitnesses of genotypes differ between the sexes, then genotypes that are disfavored in one sex may be favored in the oiher. The offsetting effects increase the opportuni- ty for a balanced polymorphism. The survivorship model of selection can be extended to include this case by supposing that the relative viabilities of the genotypes AA, Aa, and aa are given by u>u, io n , arid io 22 in females and by v lu v X2r and v 22 in males. One of the w's and one of the v's can be set arbitrarily to 1 , which leaves four litness parameters rather than two. A more serious com- plication is that the allele frequencies in gametes are no longer the same in males and females. Letting p, and p m be the allele frequency of A in female and male gametes, respectively, then the genotype frequencies of AA, Aa, and aa in the zygotes are p f p m , p f q m + q,p m and q f q m , respectively, where q t =l-pt and q m = 1 - p m One of the consequences of differential selection in the sexes is that, with an appropriate choice of fitnesses, it is possible to have more than one stable polymorphic equilibrium. A stable equilibrium is also possible with heterozygote inferiority in one sex or with incomplete dominance when selection works in opposite directions in the two sexes. X-iinked Genes Genes located in the X chromosome can have the same sort of complications as differential selection in the sexes, but the possibilities for polymorphism are not quite so numerous because there are only three fitness parameters instead of four. If A and a are alleles of an X-linked gene, then there are three genotypes in females (AA, Aa, and aa) and two genotypes in males (either A or a along with the Y chromosome) One fitness parameter in each sex can be set arbitrarily to 1. As with differential selection in the sexes, the allele fre- quencies differ in eggs and sperm. However, in any generation, the frequen- cy of A in male zygotes equals the frequency of A in female gametes of the preceding generation. If you do not understand why, think about the parental origin of the X chromosome in a male. Gametic Selection Many plants go through a life cycle in which both haploid products of meio- sis and the diploid products of fertilization are exposed to selection. In Darwinian Selection 247 mosses and vascular plants, for example, a diploid organism {the s)wmphi/lc) produces spores each of which germinates to form a haploid organism (the gametophyte) that reproduces asexually by mitosis. The gametophyfes give rise to haploid male and female gametes, which undergo fertilization creat- ing a new diploid generation. In mosses, the prominent stage of the life cycle is the gametophyte whereas, in higher plants, the prominent stage is the sporophyte. When the haploid phase of the life cycle is exposed to selection, the selec- tion is called gametic selection. As a concrete model, suppose that the rela- tive survivorships of A and a gametophyles (the haploid phase) aie given by Vi and v 2 , respectively. In the sporophytes (the diploid phase), the survivor- ships can be written as before as iv n , w n , and to 22 . Jf p and q are the allele fre- quencies of A and a at the beginning of the haploid phase, then after the differential haploid mortality has taken place, the frequencies will be p* = pvi/v and q* = qv 2 /v, where v = pu, + qv 2 With random fertilization among the gametes, the diploid genotypes AA, Aa, and aa are formed in the propor- tions p* 2 , 2p* 2 q* 2 , and q* 7 , and these survive in the relative proportions w n , iv u , and w 22 . You may verify for yourself that, at the beginning of the haploid' phase of the next generation, the allele frequency of A is P = p'w n vf+pqw l2 v^v 2 p 2 w n v 2 + Zpcjitf u v i v 2 +q 2 W 2 2V 2 This equation has the same form as the equation for p' in Equation 6 10 except that iv u is replaced with w u Vj 2 , w n with w u r>,*' 2 , and w 12 with w 12 v 2 \ The conditions for fixation or for a stable or unstable equilibrium are there- fore determined by the relative magnitude of the composite "fitness" of the heterozygous genotype relative to those of the homozygous genotypes. Meiotic Drive A situation analogous to, but distinct from, gametic selection takes place when there is non-Mendelian segregation in the heterozygous genotype. In females, unequal recovery of reciprocal products of meiosis can be caused by nonrandom segregation of homologous chromosomes to the functional egg nucleus, which is why non-Mendelian segregation is known genencally as meiotic drive. In other cases, the unequal recovery is caused by a gene or genes that act to render gametes carrying the homologous chromosome non- functional. Examples include "sperm killers" such as segregation distortion in Drosophila melanogaster (Charlesworth and Hartl 1978) and'the / alleles in the house mouse (Hammer and Silver 1993) as well as "spore killers" described in filamentous fungi (Raju 1994). Because meiotic drive acts only in the heterozygous genotype, its effect is to alter the term pqio n in Equation 6.10 for p'. This term comes from the expression V 2 x 2pqw n for the proportion of A -bearing gametes from surviv- 248 Chapter 6 ing An genotypes, and the 'A is the Mendel inn segregation ratio If the ratio of A • a gametes from An hetero/ygotes is k ■ 1 - k instead of '/ 2 : '/ 2 , then the expression for // becomes , p 2 w u +2kpqu> [2 6 23 where fS is the average survivorship in the population defined in Equation 6 8. Since A is the driven allele, k > '/ 2 . Equation 6.23 is illustrative of meiotic drive even though it requires that the non-Mendehan segregation affect both sexes equally, a case that is not generally found in practice One implication of the equation is that, unless selection counteracts the meiotic drive, the dri- ven allele goes to fixation. In particular, if the relative viabilities are equal, then p' = p 1 - +■ 2kpq and Ap = pq(2k - 1), so that p -» 1 because k > V 2 . In some examples of meiotic drive, including segregation distortion and the t alleles, the driven allele is lethal when homozygous (Hartl 1970). Assum- ing that the lethality is completely recessive, the survivorships are w u = 0, ri',2 = 1, and w 12 - 1. Equation 6.23 implies that p' = 2kp/(1 + p) and so Ap = p[{2k - 1) - p]/(l + p). There is an interior equilibrium at p = 2k - 1, which intuition suggests (correctly) is locally stable. It is also globally stable (Figure 6.10). Note that p is between and 1 for any value of k between V 2 and 1 The calculations for a recessive-lethal driven allele are a special case of the slight- ly more general model discussed in Problem 6.11. PROBLEM 6.10 Suppose that the AA genotype is not completely lethal but that its survivorship is given by 1 - s relative to a value of 1 {or both Aa and aa genotypes. Show that Ap = pq[(2k - 1) - ps]/ (1 - ph). Find p and define the conditions, in terms of k and s, for which p is between and 1. Show also that the equilibrium is locally stable. AMSWER Equation 6.23 implies that p' = [p\\ -s) + 2kpq]/(\ - ph). Ap = p' ~p simplifies to the formula given. Setting Ap~0 yields equi- libria at 0, 1, and p = (2k - l)/s. For p > Q, we need (2k - I)/s > 0, or k > y 2 . For p < 1, we need (2k - l)/s < 1, or k < (s + l)/2. Note that, as the selection against the A allele becomes smaller (s closer to 0), more values of k result in fixation of the unfavorable A allele and fewer result in an interior equilibrium. The stability of p can be deduced by evaluating the derivative in Equation 6.19. For this purpose, it is continued on page 250 Darwinian Sefection 249 (MVi.ibil.tv <>nl V 01 i('n = rr ]2 = i , w-,-, =Hlfi (Q) Meiotic drive only 01 (C) Viability and meiotic d live 1 r- 05 - A V Figure 6.10 The balance between meiotic drive and viability selection (A) Av dJ Z zuZTZ SS0S / k V1 f b : ht y sdGCli ™ ^uld eliminate the a allele <B> Meiotic ZZ< 1 '^ u C hctcroz re" l,s g^^YPe Aa produces 40% /1-bearing ElT.n''; Ing gam u tCS ' WUh meiotic driw **™> l ^ A alkie Z It } ( ^u P WrSUS P When b ° th Viablli ^ sd ^ n ™* ™^ drive are operating at the same time, using the same fitness and meiotic drive para- meters as above In this example, when both processes operate simultaneously, their offsetting clfecls create a stable polymorphism 250 Chapter 6 convenient to write Ap as pcjs{p - p)/(l - p 2 s). In taking the deriva- tive, remember that any term containing p-p becomes when p - p, so these terms can be neglected. The derivative, evaluated at p, equals -pq s/(l - p 2 s), where q - 1 - p. The sign of this number must be neg- ative, and so the equilibrium at p, when it exists, is locally stable. Multiple Alleles The presence of multiple alleles complicates the analysis of selection be- cause the number of fitness parameters increases. Wilh n alleles, there are n(n + l)/2 possible genotypes, each with its own fitness. Furthermore, sim- ple generalizations from two-allele theory do not necessarily carry over to multiple alleles. Consider the example of heterozygote superiority. Intu- itively, one might expect that fitnesses yielding stable, multiple-allele poly- morphisms would be easy to generate by requiring that each heterozygous genotype have a greater fitness than the homozygous genotypes formed from the constituent alleles. This is not the case, however. If, for n alleles, the fitnesses of the genotypes are assigned at random between and 1, sub- ject to the condition that, for each i and ;', w j; > max(n»„, w,j), then only a rel- atively small proportion of systems with four or more alleles yields a stable polymorphism with all alleles present. For four, five, and six alleles, the percentage of fitness sets yielding a stable equilibrium is 12.6, 1.2, and 0.03, respectively (Lewontin et al. 1978). The reason for the low percentages is that, even if a heterozygote is more fit than its constituent homozygotes, there might be a different homozygote more fit than all three. All right, how about requiring that each heterozygote be better than every homozygote 7 Surprisingly, this requirement does not hplp matters much. In this case, for four, five, and six alleles, the percentage of fitness sets yielding a stable equilibrium is 34.3, 10 4, and 1.3, respectively (Lewontin et al. 1978). The point is that polymorphisms with greater than three or four alleles are extremely unlikely to be maintained by selection for simple heterozygous advantage with constant survivorship. If selection is implicated in such a case, models of selection such as diversifying selection or heterogeneous environments are much more plausible. On the other hand, the fitnesses of genotypes in nature are not chosen simultaneously by a random number generator. Each new allele that arises is tested against the resident alleles, and the new allele is able to invade the population if its marginal fitness exceeds the mean fitness of the population. By this process, multiple allele polymorphisms can be accumulated, and the order in which the mutations appear makes <\ difference (Spencer and Marks 1988). Darwinian Selection 251 The possibility of multiple alleles also creates surprising situations in which the outcome of natural selection depends on the order in which the alleles are introduced into the population. Earlier in this chapter we men- tioned the sickle-cell hemoglobin polymorphism in Africa and its relation to malaria resistance. People who are homozygous A A for the normal allele are susceptible to falciparum malaria, those who are heterozygous AS for the sickle-cell allele are resistant to malaria and have a mild anemia, and those who are homozygous SS for the sickle-cell allele have a life-threatening ane- mia. This is a classic case of heterozygote superiority There is another allele, C, found at low frequency in populations in which the S allele is prevalent. The C allele is also protective against malaria, but the allele is recessive, and so only the CC genotypes are resistant. Unlike the S allele, the C allele does not cause anemia. The relative survivorship of each of the various hemoglobin genotypes has been estimated based on studies of more than 32,000 people in 72 popu- lations in West Africa (Cavalli-Sforza and Bodrner 1971). The survivorships are given in the following table, which indicates the genotypes that are resis- tant and those that have severe hemolytic anemia. The survivorships were estimated in a geographical region where malaria was common Note that the S allele causes a severe anemia in the heterozygous SC genotype, but not so serious as that in the homozygous SS genotype. Genotype AA AS SS AC SC CC Survivorship Health status 9 1.0 Resistant 0.2 Anemic 0.9 0.7 Anemic 13 Resistant Inspection of these survivorships reveals a paradox. The CC genotype has the highest fitness, yel the C allele is not fixed. The reason is found in the historical order in which the S and C mutations took place. The A allele is the ancestral type and undoubtedly predated the human settlement of regions subject to malaria. In such a region, the appearance of an S allele cre- ates a heterozygous advantage, and natural selection quickly attains a stable equilibrium at which the ratio of A : S alleles is approximately 8 . 1. At this equilibrium, the average fitness in the population is w = 0.911 . Now suppose that mutation or migration were to introduce a small number of C alleles. Because C alleles are rare, each is present in either the AC genotype, with probability %, or in the SC genotype, with probability %. The average fitness of genotypes heterozygous for C is therefore 878, which is smaller than the average fitness in the population. Hence, the frequency of C decreases, and C goes extinct. The C allele has no chance of invading an A/S polymorphism unless the initial frequency of C is sufficiently large. Figure 6,11 illustrates this phenomenon. With the survivorships given in this example, the critical initial frequency of C that allows invasion is 0.073 Once C can get established in the population, it eventually becomes fixed. 252 Chapter 6 02 nor>/ oos oi 0.12 Frequency of C allele 03 02 (101 0' -0 01 -0 02 -0 (B Figure 6.1 1 Change in frequency of the hemoglobin C allele in a population in which the A and S alleles are present in their equilibrium proportions of 8 : 1. When the initial frequency of C is small, the change in frequency is negative, and so C is eliminated even though CC genotypes have the highest fitness. The C allele is unable to invade unless its initial frequency is greater than 0,073, and in that case C goes to fixation. The plot is based on the survivorship values given in the text. Multiple Loci and Gene Interaction: Epistasis With multiple loci, as many types of gametes are possible as there are combi- nations of alleles. The simplest example is the two-locus, two-allele case, in which the possible gametes are AB, Ab, ab, and ab In the absence of recombi- nation (r = 0), each type of gamete can be regarded as an "allele" of one locus with four alleles. The principles of multiple-alleie selection then apply, and some of the "alleles" may be eliminated by selection The presence of recom- bination complicates matters because each gametic type is continually recre- ated by recombination even if it is disfavored by selection. The influence of recombination on the outcome of selection is determined by the recombina- tion fraction and by the degree of interaction between the loci. When selec- tion acts on the phenotype produced by the joint effects of multiple loci, there are two general situations: • Changes in allele frequency are driven primarily by the selection coeffi- cients and recombination plays a minor role. • Selection and recombination are about equally important in determining the outcome. The former is usually the case with weak epistasis and moderate or loose linkage; the latter is more prevalent with strong epistasis and tight linkage. The term epistasis is olten used in population genetics as a synonym for gene interaction; it applies to any situation in which the genetic effects of different loci that contribute to a phenotypic trait are not additive. In the two- Darwinian Selection 253 TABLE 6.3 TWO-LOCUS FITNESSES (SURVIVORSHIPS) 3 -§ % AA § An o c aa 91 10 w m Genotype at B locus Bb «'» bb »<* f'll W'l2 w'n ZV 21 a< 22 = 1 "'2.1 ™n "'32 W'3J Waa Note. The table assumes that the two types of double heterozygotes, AB/ab and Ab/nB, have the same fitness, uf 32 locus, two-allele example, the fitnesses (survivorships) of the genotypes can be written as shown in Table 6.3, where it is assumed that the two types of double heterozygote (AB/ab and Ab/aB) have the same fitness; for conve- nience, this value is often set at io 22 = 1 ■ For each single-locus genotype, the average survivorship is equal to the weighted average across each genotype at the other locus. In Table 6.3, these averages are denoted w AA , w Aa , and so on. Additivity across loci means that w u = zv AA + w m , w, 2 = w AA + w Bh , and so forth for all genotypes, including w lz = w M + w m = 1. If additivity does not apply across all nine genotypes, then epistasis is said to be present. A discus- sion of epistasis from a statistical point of view appears in Chapter 9 When there is strong epistasis and tight linkage, complications abound With two loci and two alleles at each, there are as many as 15 equilibria Most of them are unstable, but examples are known in which four interior equilib- ria are simultaneously stable. Figure 6.12 is one example that shows the aver- age fitness in the population w as a function of the allele frequencies of A and B. At any point in time, the gametic frequencies in the population are deter- mined not only by the allele frequencies of A and B but also by the linkage disequilibrium parameter, which was denoted by the symbol D in the section on linkage disequilibrium in Chapter 3. In this example, all 15 equilibria are realized. There are four comer equilibria in which one gametic type is fixed, namely, AB„ Ab, aB, or ab; there are also four edge equilibria in which one allele of either locus is fixed, namely, A, a, B, or b. With the survivorships as in Figure 6.12, all of the corner and edge equilibria are unstable. There are also three unstable interior equilibria, each of which has p A = p„= i/ 2 and so is located at the position of the open circle on the saddle in Figure 6.12, these equilibria have the same allele frequencies but differ in the degree of linkage disequilibrium. The positions of the stable equilibria are indicated by the solid circles, each of which represents two equilibrium points with the same 254 Chapter 6 Two stable equilibria with D = -0 074412 D = 40.074412 85 - . , 0fi7 ' ' 33 Frequency of A .ill He ° Figure 6.12 An example of two-locus, two-aUclc survivorship selection in which there arc four stable interior equilibria, the positions of which are indicated by the dots near (two of the corners Each dot represents two stable points differing in the sign of the linkage disequilibrium This example also includes three unstable inte- rior equilibria (represented by the open circled point in the center), four unstable edge equilibria, and four unstable coiner equilibria. The survivorship parameters, in the notation of Table 6 3, are w u = re n = «'n = ""n = 0.9, n/ 2i = ii> a = 0.8, and w a = uh.2 = 0-6; the recombination fraction is r = 0.09. (Example from Hastings 1985 ) allele frequencies but differing in their value of D. In this case, the equilibria are symmetrical Figure h 12 is a plot of Tv to emphasize that the average litness in the pop- ulation is not necessarily a maximum at equilibrium. In this example, none of the fou i stable equilibria is a point of maximum average fitness. The maxi- mum fitness is found at either of the four corners, and these equilibria are unstable. Furthermore, in the vicinity of each stable equilibrium, as the pop- ulation moves toward the equilibrium Irom certain directions, the average fit- ness must decrease as the equilibrium is approached. Hence, not only is average fitness not necessarily a maximum at equilibrium, natural selection can cause a decrease in average fitness. Darwinian Selection 255 In models in which fitness depends on multiple interacting loci, do we really have to give up the attractive generalization from one-locus theory that selection acts in such a way as to increase average fitness? Not altogether. Although even the two-locus, two-allele model of survivorship selection is beyond present techniques of mathematical analysis, an important general- ization has come from approximate solutions as well as from computer sim- ulations (Ewens 1979): if epistasis is not too strong, and linkage is not too tight, then the average fitness in the population usually increases. This statement is multiply qualified ("not too strong . . . not too tight . . . usually increases") because exceptions can rather easily be constructed. However, the generalization is observed in most generations in most numer- ical examples when the survivorships are chosen at random (Karltn and Carmelli 1975). To the extent that it is true, the generalization supports the powerful metaphor that natural selection tends to increase average fitness. If one can imagine a complex surface of hills and valleys corresponding to regions of high and low average fitness, then one can speak metaphorically of a population as a sort of "hill climber" moving across this surface and scaling a fitness "peak." This picturesque analogy is a central concept in Wright's shifting balance theory of evolution, which is discussed in the last section of this chapter However, there are enough exceptions to the hill-climbing gen- eralization that maximization of average fitness cannot be used as a guide to predicting the outcome of any particular set of fitness values Each model must be considered in detail on its own. Sexual Selection It seems that, wherever you look in nature, animals have physical adornments or behavioral displays to help them in obtaining mates. In some cases, there is direct competition between animals, usually males, as exemplified by the contests of antler bashing in moose or head butting in bighorn sheep In other cases, there is indirect competition, as seen in the behavioral displays of male peacocks in full plumage strutting their stuff. These are dangerous activities. A bighorn sheep can get his skull fractured or fall off a cliff. The male peacock is conspicuous, burdened, and preoccupied — vulnerable to any predator. Darwin {1871) was the first to draw attention to competition lor mates as a source of selection not necessarily related to adaptation of the organism to its environment. This type of selection he called sexual selection In the case of direct competition for mates, it is easy to understand that a successful male leaves more progeny than an unsuccessful male, and so alleles promoting the physical adornments, strength, and aggressiveness needed for successful competition for mates are perpetuated even though they may occasionally be detrimental. The example of indirect competition is considerably more subtle because the male is merely advertising The female does the choosing One theory for the evolution of male sexual displays is that, in the early stages of their evolution, the displays take advantage of a female preference. The origin 256 Chapter 6 of the initial preference is unclear Darwin suggested that female choosiness and offspring number are both associated with superior nutrition, hence choosy females may, at the beginning, have had more offspring Whatever the cause, given an initial choosiness among females, males with more effective displays are chosen preferentially as mates, and their offspring receive alleles that create both the displays in the males and the preferences in the females. If these traits are genetically correlated— as, for example, through common hor- monal or neurological pathways or through linkage disequilibrium— then selection becomes a self-accelerating process promoting increasingly elaborate displays and increasingly greater choosiness. According to Fisher (1930): The two characteristics affected by such a process, namely plumage develop- ment in the male, and sexual preference for such developments in the female, must thus advance together, and so long as the process is unchecked by severe counterselection, will advance with ever-increasing speed. In the total absence of such checks, it is easy to see that the speed of development will be propor- tional to the development already attained. There is thus, in any situation in which sexual selection is capable of conferring a great reproductive advan- tage, the potentiality of a runaway process which will, however small the beginnings from which it arose, must, unless checked, produce great effects, and in the later stages with great rapidity. The ever-accelerating process is called runaway sexual selection, and the conditions under which it takes place have been studied theoretically (Lande and Arnold 1985; Kirkpatrick and Barton 1995; Iwasa and Pomiankowski 1995). KIN SELECTION One alternative type of selection, called kin selection, makes use of an extended concept of "fitness." In kin selection, a positive selection for cer- tain alleles takes place indirectly through enhanced reproduction of the genetic relatives of carriers of the alleles rather than directly through an increased fitness of the carriers themselves Kin selection has been postulat- ed in attempts to account for the evolution of altruism. A behavior is regard- ed as altruism if it increases the fitness of other organisms at the expense of one's own fitness. Altruistic behavior is exhibited most dramatically by social insects such as termites, ants, and bees, in which certain worker castes exert their labors for the care, protection, and reproduction of the queen and her offspring but do not reproduce themselves. Other, less dramatic exam- ples of altruistic behavior include phenomena such as the care of offspring by their parents. A central consideration in kin selection is that relatives have genes in com- mon. Therefore, a gene that causes altruistic behavior can increase in fre- quency if the increase in the recipient's fitness as a result of altruism is sufficiently large to offset the decrease in the altruist's own fitness. The essen- tials of the situation can be made clear by considering the case of identical Darwinian Selection 257 twins. Because identical twins are genetically identical, the reproduction of one's twin is genetically equivalent to reproduction by oneself. Thus, it makes no difference if an altruistic organism decreases its own fitness for the sake of an equal increase in fitness of an identical twin; from an evolutionary point of view, it is an even trade because ihe combined number of offspring from both twins remains unchanged. By the same token, if an altruistic act decreases the fitness of an organism by an amount less than the increase gained by an identical twin, then the altruism results in a net increase in the combined number of offspring. One would, therefore, expect altruism between identical twins to be favored by natural selection as long as the risk to the altruist is no greater than the benefit to the recipient. These considerations of identical twins can be extended to other degrees of relationship as well, but the risk to the altruist must be correspondingly smaller than the benefit to the recipient because other types of relatives share fewer genes than identical twins. The break-even points for altruism toward various degrees of relationship have been trenchantly summarized by J. B. S. Haldane, who is said to have quipped that he would lay down his life for two brothers, four nephews, or eight cousins. In any case, fitness considerations that take into account not only an organism's own fitness but also the fitness of relatives (other than direct descendants) constitute what is called the inclusive fitness of the organism. To be concrete, suppose that altruism results in a decrease in fitness c of the altruist that is offset by an increase in fitness b in the recipient The gene for altruism increases in frequency if the ratio of cost to benefit is great enough, relative to the genetic relationship between the altruist and the recip- ient; that is, the gene for altruism increases in frequency if c 6.24 as shown first by Hamilton (1964) and discussed in detail by Cavalli-Sforza and Feldman (1978) and Uyenoyama and Feldman (1980). In this context, r is a measure of genetic relationship between the altruist X and the recipient of the altruism Y, defined as r _ 2fxv 6.25 where F x is the inbreeding coefficient of the altruist X, and F XY is the inbreed- ing coefficient of a hypothetical offspring of X and Y. As illustrated in Figure 6.13, r equals the probability that two gametes from X and Y contain alleles that are identical by descent, F XY , relative to the probability that two gametes from X contain alleles that are identical by descent, (1 + F x )/2. The cost- benefit tradeoff in Equation 6.24 is generally valid for weak selection when Fx = and valid for additive alleles even when F x 1 Q (Aoki 1981 ). 258 Chapter 6 IB) * F x )/2 Fxy Figure 6.13 Definition of the genetic relationship between an altruist X and the recipient of the altruism Y. (A) Two alleles chosen at random from an organ- ism X are identical by descent with probability (1 + F x )/2 (see Figure 4.13). (B) Two alleles chosen at random, one from X and the other from Y, are identical by descent with probability F XY , which is the inbreeding coefficient of a hypotheti- cal offspring of X and Y The ratio of F xv to (1 + F x )/2 is the appropriate measure of genetic relationship in the consideration of kin selection. PROBLEM 6.1 1 For the illustrated pedigrees (A) and (B) of full sib- lings shown in the accompanying figure, calculate the break-even value of the benefit h to the recipient of altruism Y, relative to a cost value c = 1 to the altruist X, in order to ensure an increase in frequency of an addi- tive gene for altruism. Why are the answers different in the two cases? (B) ANSWER In case (A), a hypothetical offspring of X and Y has an inbreeding coefficient of F XY = {Vif + (V 2 ) 3 = V*, and F x = 0. Therefore, r = 2 x y 4 = % and the break-even value of c/b = Vi- Hence, for c = 1, the break-even value of b =* 2. (This calculation is the theoretical basis of Haldane's quip about laying down his life for two brothers.) In Darwinian Selection 259 pedigree (B), F m = 4 x (V 2 ) 5 + 2 x (V 2 ) 3 = %, and F x = 2 x (Y 2 ) 3 = V 4 . Therefore, r * 2(3/ fi )/{l + »/ 4 ) = % For a cost of c = i, the break-even value of b equals %. The values differ in the two cases because of the differing inbreeding. In case (B), even though X is inbred, the break- evert value of b is smaller because of the closer genetic relationship between X and Y. INTERDEME SELECTION AND THE SHIFTING BALANCE THEORY Another alternative type of selection arises in the context of interdeme selec- tion, which takes place between semi-isolated subpopulations (denies) of the same species. If subpopulations composed of certain genotypes are more likely to become extinct and have their vacated habitats recclonized by migrants from other subpopulations composed of other genotypes, then the more successful subpopulations can, in some sense, be considered as having a greater "fitness" than the less successful ones Since this concept of popula- tion fitness is a characteristic of the entire population and not merely the average fitness of the genotypes within it (w), interdeme selection is outside the realm of most conventional models of selection. Interdeme selection is one type of group selection (Wilson 1983). Interdeme selection plays an essential role in the shifting balance theory of evolution of Wright (1977 and earlier). In the shifting balance theory, a large population that is subdivided into a set of small, semi-isolated sub- populations (demes) has the best chance for the subpopulations to explore the full range of the adaptive topography and to find the highest fitness peak on a convoluted adaptive surface. If the subpopulations are sufficiently small, and the migration rate between them is sufficiently small, then the subpopulations are susceptible to random genetic drift of allele frequencies, which allows them to explore their adaptive topography more or less inde- pendently. In any subpopulation, random genetic drift can result in a tem- porary reduction in fitness that would be prevented by selection in a larger population, and so a subpopulation can pass through a "valley" of reduced fitness and possibly end up "climbing" a peak of fitness higher than the original. Any lucky subpopulation that reaches a higher adaptive peak on the fitness surface increases in size and sends out more migrants to nearby subpopulations, and the favorable gene combinations are gradually spread throughout the entire set of subpopulations by means of interdeme selection. The shifting balance process includes three distinct phases: 1. An exploratory phase, in which random genetic drift plays an important role in allowing small subpopulations to explore their adaptive topography 260 Chapter 6 2 A phase of mass selection, in which favorable gene combinations created by chance in the random drift phase become rapidly incorporated into the genome of local subpopulations by the action ol natural selection. 3. A phase of interdeme selection, in which the more successful demes increase in size and rate of migration; the excess migration shifts the allele frequencies of nearby subpopulations until they also come under the control of the higher fitness peak. The favorable genotypes thereby become spread throughout the entire population in an ever-widening distribution. Where the region of spread from two such centers overlaps, a new and still more favorable genotype may be formed and itself become a center for interdeme selection. In this manner, the whole of the adaptive topography can be explored, and there is a continual shifting of control from one adaptive peak to control by a superior one. The shifting balance theory has played an important role in evolutionary thinking, in part because of its use of mountain -climbing terms as tropes for stages in the evolutionary progress: "exploration" of the adaptive topogra- phy, chance "discovery" of a route to a higher adaptive peak, and ultimately the "conquest" of the highest adaptive peak by the whole species. However, as a comprehensive theory of evolution, many aspects of the theory remain untested. For the theory to work as envisaged, the interactions between alle- les must often result in complex adaptive topographies with many peaks and valleys. The population must be split up into smaller subpopulations, which must be small enough for random genetic drift to be important, but large enough for mass selection to fix favorable combinations of alleles. Although migration between demes is essential, neighboring demes must be sufficient- ly isolated for genetic differentiation to take place, but sufficiently connected for favorable gene combinations to spread. Because of uncertainly about the applicability of these assumptions, the shifting balance process remains a pic- turesque metaphor that is still largely untested. However, computer simula- tions have been carried out to investigate the range of magnitudes of the key parameters that are necessary for the shifting balance process to be effective; these parameters include the size of the subpopulations, the rate of migration and range of dispersal of the migrants, the degree of epistasis between genes, and the rate of recombination (Bergman et al. 1995). Some empirical studies have also explored the partitioning of genetic variance within and between groups for traits associated with fitness (Wade and Goodnight 1991). One important implication of interdeme selection is that alleles that are harmful in themselves may nevertheless be favored because they are benefi- cial to the group. This principle is illustrated in the model in Table 6.4, where the allele A' is harmful to organisms within demes but favorable to the deme as a whole. Equation 6.11 implies that, within the ith deme, Aq, = -cq,(l - q,) (assuming that w = 1). Averaging across all of the subpopulations, the change in allele frequency resulting Irom selection within subpopulations, Aq u „ equals -c<T(1 -rf)(l - F), where F is the fixation index F S7 discussed in Chap- Darwinian Selection 261 TABLE 6.4 MODEL OF INTERDEME SELECTION Genotype AA Frequency in deme i Within-population fitness Between-population fitness of deme i AA' 1 -c 1+20 -<:),;, A'A' 1 -2c ter 4. At the same time, within-subpopulation selection takes place, inter- deme selection favors demes containing A', and the change in allele fre- quency resulting from between-subpopulation selection, Aq b , equals 2(b - c)q (1 - q)F, as shown by Crow and Aoki (1982). Putting the within- subpopulation and between-subpopulation selection together, the total change in the frequency of A' is Aq = Aq w + Aq h = -cq( l-fl)(l-F) + 2{b - c)q{\ - q)F 6.26 The terms on the right-hand side can be interpreted by considering the extremes of F = and F = 1. When F = 0, there is no population substructure, which means that all subpopulations have the same allele frequency q; in this case, the change in allele frequency is just -cq{\ - q). At the other extreme, when F = 1 each subpopulation is fixed for either A or A', and the proportion fixed for A' equals q. The between-subpopulation selection is therefore analogous to selection between alleles in a haploid organism in which the fitnesses of A and A' demes are in the ratio 1 : 2(b - c). In this case, therefore, the change in allele frequency is 2(b - c) q(\ - If) (from Equation 6.6, assum- ing that w = 1). Equation 6.26 implies that A<? > if b-c 1-F c > 2F 6.27 This is the condition necessary for selection between demes to override selection within demes, and the formulation is quite general (Crow and Aoki 1982). A biological interpretation of the inequality in Equation 6.27 can be inferred by comparison with the break-even point for kin selection given in Equations 6.24 and 6.25. Expressing 6.27 in terms of r = 2F/(1 + F), which means that F = r/(2 - r), yields c/b < r; this condition is identical to Equation 6 24. In these models, the equivalence between kin selection and interdeme selection results from the shared remote ancestry of the members of each subpopulation caused by random genetic drift among the subpopulations. The members of each subpopulation are related by kinship, and so interdeme selection is the same phenomenon as kin selection; the break-even point is that at which the benefit b to one's kin through interdeme selection equals the cost to one's self c through direct selection against I he A' allele. 262 Chapter 6 If there are a large number of subpopulalions, each of size N, that exchange migrants in such a way that m is the proportion of genes in each deme that are exchanged each generation for genes chosen at random from the other demes, then the approximate value of F at equilibrium is given by Equation 5 17 as F = 1 /(] + 4Nm). Consequently, the right-hand side of Equa- tion 6 27 becomes INm. In other words, (1 - F)/2F equals the number of migrant diploid organisms per generation We therefore conclude from Equa- tion 6.27 that selection between demes overrides selection within demes only when the benefit to the group (b - c), relative to the cost to the individual organism (r), is greater than the average number of migrant organisms per generation This principle defines a rather stringent limit above which migra- tion among demes cancels any possible effects of interdeme selection. SUMMARY Natural selection can take place in many different ways. The simplest case is that in a haploid organism in which the relative fitnesses of the alternative genotypes are constant. Models of discrete generations and of continuous exponential growth are presented In the discrete model, the relative fitness- es are called Darwinian fitnesses; in the continuous model, they are called Malthusian fitnesses. The relationship is that In w - m, where w and m are the Darwinian and Malthusian fitnesses, respectively. In a diploid organism, continuous population growth is difficult to model when the genotypes differ in their rates of reproduction. The "standard" diploid model is that of discrete generations in which the genotypes may dif- fer in the probability of survival from fertilization to adulthood (survivorship or viability selection) but are equal in fertility. In such a model with two alle- les, A and a, and constant fitnesses of the diploid genotypes, four outcomes of selection are possible: A becomes fixed; a becomes fixed; there is a globally stable equilibrium; or there is an unstable equilibrium. Fixation of A or a results from directional selection in which either AA or an is favored and the fitness of the heterozygous genotype is intermediate between the homozy- gous genotypes (or possibly equal to one of them). The stable equilibrium results from heterozygote superiority (overdominance), in which the fitness of the heterozygous genotype exceeds that of both homozygous genotypes. At the stable equilibrium of allele frequency, the average fitness in the popu- lation id is maximized. An unstable equilibrium arises when the fitness of the heterozygous genotype is smaller than that of both homozygous genotypes. The outcome of selection then depends on the initial conditions; fixation of either ' < t n takes place according to whether the initial frequency of A is greatei than or less than the unstable equilibrium frequency. Mutation -selection balance refers to the maintenance ot a harmful allele in a population at a low equilibrium frequency because, in every generation, the Darwinian Selection 263 elimination of preexisting harmful alleles by selection is offset by the introduc- tion of new harmful alleles by mutation. For a completely recessive allele in which the relative fitness of the homozygous recessive genotype is 1 - s, the equilibrium frequency of the harmful allele is given by q = VuTT, where u is the rate of mutation per generation of the wildtype allele to the harmful allele. For a partially dominant allele, the relative fitnesses of the het- erozygous and homozygous genotypes carrying the harmful allele are 1 - lis and I - s, where h is the degree of dominance. In this case, the equilibrium allele frequency is given approximately by q = yt/hs An important implica- tion of these formulas is that a small degree of dominance of a "recessive" allele has a disproportionate effect in decreasing the allele frequency at equi- librium. Another important implication is the Haidane-Muller principle, which states that, at mutation-selection balance, the total genetic load (mea- sured as the product of the genotype frequency times the decrease in fitness of the genotype) is independent of the fitnesses and depends only on the rate of recurrent mutation In nature, selection must often be expected to have a more complex mech- anism than that of differential survivorship envisaged in the standard model. Among the more complex types of selection are frequency-dependent selec- tion, density-dependent selection, fecundity selection, selection in age-struc- tured populations, selection when there are heterogeneous environments, diversifying selection favoring rare alleles or genotypes, differential selection in the sexes, selection for X-Iinked genes, gametic selection, meiotic drive (non-Mendelian segregation), multiple-alleles selection, multiple-loci selec- tion, and sexual selection. Multiple loci are a particularly important source of complexity even when the fitness differences are entirely due to survivor- ship. In particular, with strong epistasis and tight linkage, there may be mul- tiple stable interior equilibria and the equilibria may not coincide with points of maximum average fitness. With weak epistasis and loose linkage, howev- er, the average fitness in the population usually does tend to increase Extended concepts of fitness can include the effects of selection acting on groups of relatives or on subpopulalions. Kin selection invokes the concept of inclusive fitness, which embraces not only an organism's own fitness but also the fitness of its relatives (exclusive of direct descendants) Kin selection has been invoked to explain the evolution of many behavioral traits that appear to be detrimental to the individual organism but beneficial to its relatives. The most dramatic examples are found in social insects, in which certain organisms are reproductively sterile and devote their lives to the care and feeding of the queen and the protection of the colony Generally speaking, alleles for altruistic behavior can increase in frequency if the loss in fitness of the altruist is offset by the increase in inclusive fitness to the beneficiaries of the altruism. More precisely, for additive alleles, the condition for increase in frequency of an allele predisposing to altruism is c/b < r, where r and b are 264 Chapter 6 the fiiness cost to the altruist X and benefit to the relative Y, respectively, and r = 2F xv /(l+F x ). Jnterdeme selection plays an important role in the shifting balance theo- ry of evolution. According to this theory, adaptive topographies are highly complex surfaces with many peaks and valleys. In small, partially isolated subpopulations, random genetic drift promotes the random exploration of the topography. When, by chance, a subpopulation comes under the control of a higher fitness peak, mass selection takes precedence and rapidly multi- plies the favored gene combinations Excess migration from the successful subpopulation shifts the allele frequencies in surrounding subpopulations and, through repetition of the selection process, the favored gene combina- tions progressively spread m waves throughout the entire population. Influ- ential as metaphor, the shifting balance theory has not yet been adequately evaluated as an accurate description of the principal mechanism of evolu- tionary change. PROBLEMS 1 . Suppose that in the ?th generation of a haploid population the fitnesses of A and a are 1 : s ( . Show that p„/q„ = (p /qo)(s&iS2 • ■ Vi)- U this is written as p n /q„ = (po/<Jo)s", tnen how can s be interpreted? 2. If the fitnesses of AA, Aa, aa are 1 .0, 0.9, 0.6, and p = 0.7, calculate p h p 2 , and p vV the allele frequencies after 1, 2, and 3 generations of selection. 3. Calculate the equilibrium allele frequency with overdominance when the fitnesses of AA, Aa, and aa are, respectively: a. 300,1,0.700. b. 0.930, 1, 0.970. c 993,1,0 997. 4. Calculate w for w„ = 0.9, iv n = 1, Wu = 0-6, and y = 8, assuming random mating. Does any other p give a larger ft>? Why or why not 7 5. If a rare allele that is lethal when homozygous decreases in frequency by 1% each generation (i.e., q' = 0.99q), then what is the selection coefficient against heterozygotes? (Hint: Assume that qlt is small compared to 1.) 6. If selection is not too intense, an additive gene giving fitnesses 1 + s, 1 + s/2, and 1 in AA, Aa, and aa will increase in frequency approximately according to \n(p,/q,) = lnfo,/^) + (s/2)'. Calculate the approximate num- ber of generations required to evolve significant insecticide resistance in an insect population when s = V 2 and p n = 10 5 . Significant resistance in the population may be taken as p, = 10"' Show that, when p,/p « 1, i = (2/s)\n( Pl /p ). 7. Show that a random mating diploid population with fitnesses 1 , 1 - s, and (] _ s f f or AA, Aa, and aa gives the same change in the allele frequency p of A as a haploid population with fitnesses 1 and 1 - s of A and a. Darwinian Selection 265 8. If selection is not too strong, the time required for the allele frequency of a favored dominant allele to change from p to p, is given by HPi/q,) + (1 /q t ) = lHp„/q l} ) + (1 /(jo)] + si Use this equation to derive the analogous equation for a favored recessive. 9. The following equation has equilibria at p = 0, >/ 2 , and 1. Classify the equilibria as to stability. If there is a stable equilibrium, is it locally or globally stable? Ap = p(V 1 -p)(l-p) 10. Show that the allele frequency of a recessive lethal in generation n is given by q„ = q /(l + nq ). {Hint: It is easiest to derive an expression first for l/q„.) How many generations are required to reduce the allele fre- quency by half? 11. The mutation rate to a dominant gene for neurofibromatosis is approxi- mately 9 x 10~ 5 and the reproductive fitness of affected individuals is esti- mated as V 2 . What is the expected equilibrium frequency of affected individuals at birth? 12. What is the equilibrium frequency of a recessive gene arising with a mutation rate of 4 x 10~ 6 and a reproductive fitness in homozygotes of 0.8? What would it be if the gene were partially dominant with h = 0.05? 13. What is the equilibrium frequency of a recessive gene arising with a mutation rate of 10" 6 with a fitness of 0.4 in homozygotes? How much would this be reduced if the homozygotes did not reproduce at all? 14. For a rare allele maintained at an equilibrium frequency of q = u//i, where h is the selection coefficient against heterozygotes, show that the proportion of heterozygous zygotes resulting from new mutations is approximately equal to h. 15. A polymorphism is said to be protected if all of the fixation states are unstable equilibria. Suppose the viabilities of males and females are as follows: AA Aa Females Male 0.9 1.0 I Pi 2 8 5 What is the smallest value of v n that ensures a protected polymor- phism? {Hint: Some algebra shows that a condition for polymorphism is u> ]2 /w n + V\i/v X \ > 2 and w ]2 /w 2 2 + v n /v n > 2.) 16. If allele a is a recessive lethal in zygotes and the relative fitness of A: a gametes is 1 - s : 1, then what is the equilibrium allele frequency of a? 266 Chapter 6 {Hint: The recursion simplifies greatly for the ease of a recessive lethal, and equilibrium is given by p = v { w n j[{v\+v 2 )w\2-V\W\x\) 17 In a Drosoptiila population cage containing a meiotically driven chromo- some known as segregation distorter, the equilibrium frequency of the driven chromosome was approximately 0. 125 and the segregation ratio in heterozygotes was about k = 0.75 (Hiraizumi, Sandler, and Crow 1960). The meiotic drive chromosome is homozygous lethal in both sexes. The equilibrium between viability and meiotic drive in this case is p = 2(Jt - l)wi 2 /(l - 2 ™n)- Use this equation to estimate the approximate value of W\ Z consistent with these data. 18 In a multiple allele system in which each heterozygote is superior to the homozygotes for the alleles it contains, why are all alleles not maintained bv selection? 19 The viabilities of genotypes A' A', A' A, and AA are 0.5, 1, and 7, respec- tively. If the initial frequency of allele A' is .05, what will the frequency be when the population comes to equilibrium? If a mutation occurs, intro- ducing a novel allele A", such that the fitnesses of A" A", A"A', and A"A are all 0.8, determine whether this allele will increase in frequency. 20. Suppose alleles A u A 2 , A„ and A, are additive in their effects, and the homozygote fitnesses are /Mi : 0.8, A 2 A 2 : 0.6, A 3 A 3 : 4, and A^ : 2. What are the heterozygote fitnesses 7 If all alleles are equally frequent, what is the mean fitness for this locus? CHAPTER 7 Random Genetic Drift Random Genetic Drift Binomial Sampling Coalescence Wright-Fisher Model or each c.FNERATiON there is an element of chance in the drawing of gametes that will unite to form the next generation. Chance alone can result in changes in allele frequency, and because the allele frequencies do not change in any predetermined way by this sampling process, the process is known as random genetic drift, hi Chapter 5 we looked at some of the basic principles of how random genetic drift affects lev- els of variation in populations, but the subtlety and importance of drift are such that we will now devote this chapter to the subject. RANDOM GENETIC DRIFT AND BINOMIAL SAMPLING Consider a large population at Hardy-Weinberg equilibrium with alleles A and a at equal frequencies p = q = 'A. In (his population, the genotype fre- quencies are */ 4 AA, V 2 Aa and V 4 oa. Suppose four individuals are drawn ai random from this population to start a colony. It is possible, by chance alone, that the sample will consist of 4 AA individuals. (This chance is O/4) 4 = 1 /256.) Similarly, it is possible that all four will be na Any other possi- ble sample could have been drawn, and it is not difficult to work out the probability for each type of sample. If the colony remains at just four indi- viduals, this same kind of random sampling occurs each generation At each generation, there is an opportunity for a large change in gene frequency caused purely by this process of sampling. One consequence of drift soon becomes clear — eventually the population will have either all A alleles or all 267 268 Chapter 7 Figure 7.1 The gene frequencies and sampling that occur in the Wright- Fisher model Initially there are N diploid adults with a gene whose frequency is p„. The adults make an infinite number of gametes having the same allele fre- quency From this pool, IN gametes are drawn at random to constitute the N diploid individuals for the next generation. « alleles. Once the population reaches such a "fixation" state, it is stuck. Only new mutations or migrants into the population can reintroduce variation. In the example above we sampled four diploid individuals each genera- tion For our purposes, this is equivalent to drawing eight gametes at random from a pool of gametes. For example, if eight gametes are drawn from a pop- ulation with p = V 2 , there are nine possible outcomes, having 0, 1, 2, 3, . . . , 8 copies of the A allele and the remaining copies being the a allele. The proba- bility of each of the nine possibilities is given by the binomial distribution, first introduced in Chapter 1. For the case of fixation, we need to find the probability of drawing eight copies of the A allele. Each draw is considered independent of the other draws, and each has a chance of V 2 of yielding an A. This means that the probability of drawing eight consecutive A alleles is (V 2 ) H = 1 /256. It is no coincidence thai this is the same as the probability of drawing four AA genotypes as described above. In sampling gametes from a finite population, the sampling process is depicted in Figure 7.1. In each generation there are N diploid individuals in the population. Regardless of the way fertilization occurs, we can imagine the sampling process to be one of sampling with replacement, such that the diploid individuals contribute to an essentially infinite gamete pool whose allele frequency is the same as the allele frequency in the adults. From this infinite gamete pool, 2N gametes are drawn and unite at random to form the nexf generation. Under this kind of sampling process, the distribution of fre- quencies of gametes is expected to be binomial. PROBLEM 7. 1 Suppose there are a thousand round pea seeds and a thousand wrinkled pea seeds in a soup pot. Enumerate at! possible samples of four seeds drawn from the pot, and calculate the probabil- ity of each. Random Genetic Drift 269 ANSWER The chance of drawing a round seed is V 2 (as is the chance of drawing a wrinkled seed). The chance of drawing four round seeds is roughly (V 2 ) 4 = y w , since the fraction of round seeds remains fairly close to V 2 even after a few are drawn. The chance of drawing all four wrinkled seeds is also V 16 . There are four ways to get three round and one wrinkled seed: RRRW, RRWR, RWRR, and VVRRR, and each of these has chance V 16 . Similarly, there are four ways to get three wrin- kled and one round seed: WWWR, WWRW, WRWW, and RWWW, and again, each of these four possibilities had probability V u , Finally, we could draw two round and two wrinkled seeds, and such a sample could be drawn iri any of Six possible orders: RRWW, RWRW, RWWR WRRW, WRWR, and WWRR. Each order has chance V 16 , so the total chance of getting two round and two wrinkled seeds is 6 / ]6 = %. We have exhaustively enumerated all 16 possible samples of four seeds, and since each of the 16 possibilities has chance VW the sum of the probabilities of the events is 1. This is a check that we have considered all possibilities. Note that the binomial distribution (Equation 7.1) makes these calculations much easier. To take a specific example, a population of nine diploid organisms arises from a sample of just 18 gametes, but the gametes can be thought of as being sampled from an essentially infinite pool of gametes. Because small samples are frequently not representative, an allele frequency in the sample may dif- fer from that in the entire pool of gametes. In fact, if the number of gametes in a sample is represented as 2W (in this example, 2/V = 18), the probability that the sample contains exactly / alleles of type ,4 is the binomial probability Pr(0 = | 2 *V<7 2N -' 7.1 2N where means (2N)l/i!(2N - /)'; p and q are, respectively, the allele fre- quencies of A and a in the entire pool of gametes (p + q = 1); and / takes on any integer value between and 2N. The new allele frequency in the popula- tion (call it p'} is therefore i/2N because, by definition, the allele frequency of A equals the number of A alleles (in this case /) divided by the total (in this case 2N). In the next generation, the sampling process occurs anew, and the new probability of a prescribed number of A alleles occurring in the 2/V 270 Chapter 7 gametes is given by the binomial probability above, with p now replaced by ;/ and q by 1 - p' Thus, the allele frequency may change at random from gen- eration to generation Computer-generated examples based on random num- bers are shown in Figure 7.2. Each line in Figure 7.2A gives the number oi A alleles in 20 successive generations of random genetic drift in a population of size N = 9 (so 2N = 18). As you can see, individual populations behave very erratically. In seven populations, the A allele became fixed (that is, p = 1); in five populations, A became lost (that is, p - 0). The other eight populations remained unfixed {A was neither fixed nor lost), but the final allele frequen- cy among the unfixed populations was as likely to be one value as another. Figure 7.2B shows the same kind of simulation, except now 2N = 100. With a larger population size, the rate at which populations go to fixation is evi- dently slower. The principal conclusion from Figure 7.2 is that allele frequen- cies behave so erratically in any one population that prediction is virtually impossible. Although changes in allele frequency due to random genetic drift in any individual population may defy prediction, the average behavior of allele fre- quencies in a large number of populations can be predicted. Consider a large number of populations all starting at the same time with the same allele fre- quency and same population size N. Each of these populations is assumed to undergo drift independently of the other populations. Except for their finite size, the subpopulations are assumed to satisfy all the assumptions of the Hardy-Weinberg model, with the additional stipulations that (1) the number of males and females is equal, and (2) each individual has an equal chance of contributing successful gametes to the next generation. The key point illustrated in Figure 7.3 is that we can describe how these populations change in allele frequency by considering time slices through the graph, and tallying a histogram of the counts of populations having each specified allele frequency. Initially, the populations will all be close to the starting allele fre- quency. As time passes, the populations "drift" apart, and eventually they are spread over all possible allele frequencies. Finally, as we will see, each popu- lation must go to fixation for one allele or the other. The trick in understanding drift is to learn how to deduce the distribu- tions of allele frequencies plotted in Figure 7.3. We just described what would happen after one generation — the set of populations would have a range of allele frequencies as described by the binomial distribution. The binomial distribution gives us the probability that a population has allele frequency p' after one generation of drift. If we consider 1000 populations all starting at p, the binomial distribution gives us the fraction of those populations with allele frequency p'. What about the following generation? For each popula- tion, one can imagine the whole sampling process as starting over again. The population does not remember where it was the previous generation, and so the binomial sampling occurs again. But this time, the allele frequency isp', Random Genetic Drift 271 8 12 Generation (B) 1 r 2W= 100 « 12 Generation Kefctfc drift ^7'" Simulations of the Wrigh t -Fish,r model of random genetic drift. Each line represents a population of size (A) 2W , 18 or (B) 2N - op ncement as described in the text. An allele frequency of n = S ,n A implies SlW if ! m P ,u » H> copies of each allele Not, fhnt the l^er pV- « eof I Z on m Sm " osdIlat, « n! ' ° f "We frequency, .md a sloJe/ 272 Chapter 7 Generation 10 1 Allele frequency, x Figure 7.3 The model of random genetic drift can be seen by imagining a large collection of populations undergoing the process of repeated sampling. As the top pari of the figure indicates, the populations' allele frequencies change erratically, and tend to drift apart. At time intervals, a snapshot of the popula- tions would produce distributions of allele frequencies whose variance increases over time. and this value must be the frequency used in Equation 7.1 . Each of the 1000 populations may have a different p' after one generation of drift, so to get the second generation, we need to calculate the binomial distribution 1000 times and add the values up. Fortunately, R.A. Fisher and Sewall Wright figured out an easier way to do this, which is described in the next section. An experiment designed along the same lines as the one given in Figure 7.3 is shown in Figure 7.4. In this study, the history of 19 generations of ran- dom genetic drift in 107 subpopulations of Drosophila melanogaster was fol- lowed. Each population was initiated with 16 bzo^/bw {bw = brown eyes) heterozygotes and maintained at a constant size of 16 individuals by ran- domly choosing eight males and eight females to produce the next genera- tion Each histogram in Figure 7.4 gives the number of populations containing 0, 1, 2, . . . , 32 bw 75 alleles. The pattern of change in allele frequen- cy in Figure 7.4 may at first appear to be complicated, but in reality a simple thing is happening. The initially humped distribution of allele frequency gradually becomes flat as populations fixed for bur 75 or bw begin to pile up at the boundaries. The piling up occurs because, once an allele has been fixed or Random Genetic Drift 273 Figure 7.4 Random genetic drift in 107 actual populations of Drosophila melanogaster. Each of the initial 107 populations consisted of 16 baPfko het- erozygotes (N = 16; bw = brown eyes). From among (he progeny in each genera- tion, eight males and eight females were chosen at random to be the parents of the next generation. The horizontal axis of each curve gives the number of bw 75 alleles in the population, and the vertical axis gives the cor res ponding number of populations. (Data from Buri 1956.) lost, it remains fixed or lost since mutation is negligible over such a small number of generations in small populations After 19 generations, most of the populations are fixed for one allele or the other, and among the unfixed pop- ulations, the distribution of allele frequencies is essentially flat. PftOBLtM 7,2 Cbfifttftf a Self-pollinating plant population consist- ing of a single heterozygous {Aa) individual on a small barren island. Suppose the plant reproduces and dies, so that the generations are discrete, and the population can only consist of a single plant. What is 274 Chapter 7 the probability that the population is homozygous at this genetic locus by the second generation? ANSWER The chance that the first generation offspring is AA is V 4 and the chance that it is aa is also V 4 , so the chance of fixation in one generation is V 2 . If the first generation offspring is Aa, then the proba- bility of fixation in the second generation {given that the population is not fixed in the first generation) is again % The probability of not fix- ing in generation 1 and then fixing in generation 2 is V 2 x % = V* Add to this the chance of fixing in one generation and we get 3 / 4 as the probability of fixation by two generations. Note that the probability of not going to fixation each generation is V2, and so the chance of not fixing for two generations is Vi x V2 = '/*■ Consider an infinitely long bowling alley with minor imperfections that displace the ball one way and the other. The gutters represent the fixation states of p = and p = 1. Once the ball goes in the gutter, it cannot get out again. The imperfections keep the ball from rolling in a straight line, and eventually it rolls into the gutter. In this analogy, the size of the population corresponds to the width of the bowling alley; a larger population implies a wider alley. The imperfections still deflect the ball but, in proportion to the width of the alley, the ball's zigs and zags are of a smaller magnitude. Conse- quently, the ball remains out of the gutter for a longer time, analogous to the longer time to fixation for a larger population. But just as certainly, the ball will eventually land in the gutter. THE WRIGHT-FISHER MODEL OF RANDOM GENETIC DRIFT Fisher (1930) and Wright (1931) both considered the consequences of the sort of binomial sampling that occurs in small populations when the sampling occurs repeatedly over many generations. This model, known as the Wrighl- Fisher model, derives the distribution of allele frequencies among popula- tions undergoing random genetic drift. Although neither Fisher nor Wright formulated the problem in terms of matrices, as used here, this approach makes the problem much simpler and gives the same results. If a population has 2N genes, and there are two alleles (A and a) that may be segregating, then the state of the population can be described by the number of A alleles in the population. The possible states are (hen 0, 1, 2, 2JV. The states and Random Genetic Drift 275 2N are special in that these are fixations, and once the population gets into these states, it cannot leave. The states and 2/V are called absorbing states From any other allele frequency, it is possible for the population to drift to a different allele frequency. However, to use an example from Figure 7 4, if 2N = 32, then the chance of drifting in one generation from 30 copies of gene A to 29 copies of gene A is greater than the chance of drifting to two copies. The probability of the population drifting from the state having 1 copies to ) copies of allele A is known as the transition probability The transition prob- ability for the Wright-Fisher model is obtained directly from the binomial distribution. If a population has ("copies of allele A, then the allele frequency is p = i/ZN, and the frequency of allele aisq=\- //2/V. The probability of going from / copies of A to / copies of A in one generation is: T„ = >a 2N -i p'q 72 The transition probabilities can be put in a square matrix T, with elements T, f giving the transition probability from state / to state/ for i r j = 0,1,2, , 2N The matrix T contains everything that is needed to predict the expected distri - bution of populations like those in Figure 7.4 over a series of generations This type of model, expressed in terms of discrete states with fixed probabilities of going from one state to another, is known as a Markov chain, and it has some very elegant mathematical properties. Iterations of (he Wright-Fisher model give the expected outcome of a pure drift process (Figure 7.5). We will use the Wright-Fisher model only to show one aspect about fixation probabilities. PROBLEM 7.3 Consider a population of four diploid individuals. Calculate the probability that a population with four copies of allele A {allele frequency p = V 2 ) drifts in one generation to having three copies. What is the probability that the population has four copies of -A? Five copies? Now consider a population of the same size, but ini- tially with two copies of A What is its probability of drifting to one, two, or three copies? ANSWER Applying Equation 7.2, we get T 4 , = [8!/(5!3!)](V 2 ) s = 7/32 = 0.219. Ty = [8!/(4!4!)J {%f = 70/256 = 0.273"! T 4r , = T w = 0.219. (Note that the binomial distribution is symmetric when p = % so there is equal probability for samples that are symmetrically divergent from 276 Chapter 7 p - V 2 .) In the case when the initial frequency is 2 / 8 , T n = [8!/l!7!)J (V 4 )(3/ 4 ) 7 = 0.267, T u = [8!/(2!6!)l (>/ 4 )W = 0. r w = [8!/(3!5!)](y 4 ) 3 (%) 5 = 0.208. we get 311, and The above problem illustrates an important feature of the Wright-Fisher model. The magnitude of change in allele frequency is greater when the allele frequency is V 2 than it is when the allele frequency is more skewed. The changes are greater because the variance in the binomial sampling distribu- tion is greatest when p = l / 2 . (The formula for the binomial variance is pq/ 2b!.) The variance drops to zero a« p = and p = 1. The variance formula makes it clear that a large population will change allele frequency more slowly than a smaller population because the sampling variance varies as the reciprocal of population size. Furthermore, the probability of an increase in allele frequen- cy is the same as the probability of a decrease in allele frequency, regardless of the allele frequency. The process of drift does not recognize when a popu- lation is close to a fixation. The chance of drifting up in frequency is always equal to the chance of drifting down in frequency, regardless of the current population allele frequency. Fisher and Wright also addressed the expected time to fixation. Since another approach yields the solution to this problem much more easily, we will consider times to fixation in the next section. PROBLEM 7.4 Simulating random drift can be a very time-consunv ing proposition. If one wants to simulate a population of 1000 indi- viduals for 1000 generations, one has to draw 10 6 random numbers and for each decide whether to accept or reject each genotype. Kirrm- ra (1980b) came up with a shortcut that relates very closely to how the diffusion approximation works (see the next s ection). The trick is to use the recursion: p' * p + (2U - l)V(3p?/2N), where U is a random number uniformly distributed between and 1. Each generation, one picks one random number U, and a sample realization of the next generation's allele frequency is gotten from the above recursion. Why does this approach work? (Hint: The variance in a uniform distribu- tion is the square of the range divided by 12.) Random Genetic Drift 277 ANSWER The expression 2U - 1, where U is a number between and 1, g ives a val ue from -1 to +1, or a range of 2. The range of (2li - 1)V(3/mj/2N) is therefore 2^3pq/2N). Squaring this expression and dividing by 12, the variance of this uniform random variable is thus pqflN, just what we get from a binomial sampling distribution. Each generation the allele frequency has an equal chance of increas- ing or decreasing, and the variance in the allele frequency change is pq/2N. Even though the distribution of change in allele frequency is uniform in the pseudosarnpling simulation instead of binomial (as it is in the Wright-Fisher model), this process can reproduce most of the results of the complete brute-force simulation at a tiny fraction of the computer time. THE DIFFUSION APPROXIMATION The pattern of change in allele frequency shown in Figure 7.4 is very nearly that expected theoretically for an ideal population, and although the full- blown theory of random genetic drift requires mathematics beyond the scope of this book, some background might be of interest (see Kimura 1955, 1964, 1976, Wright 1969; Crow and Kimura 1970; Kimura and Ohta 1971). The rep- resentation of random drift by a differential equation was first applied by Fisher (1922), who noted that the equation describing the diffusion of heat through a solid bar applies to random genetic drift. The distribution of popu- lations with allele frequencies ranging from to 1 is called <p(x,t), where x rep- resents the allele frequency and I indicates time. Figure 75 shows a particular realization of ${x,t) changing through time. The theoretical problem is to for- mulate an equation that describes how $(x,t) changes under random genetic drift, and to solve the equation. The parallel between the physical process of diffusion and what is actual- ly going on in a finite population is a bit abstract, but it is not particularly dif- ficult. Consider an axis of allele frequency extending from to 1 . The number ol populations whose frequency is between x and r + Bx at time t is our prob- ability density ${x,t). Populations may enter this range of allele frequencies by drifting in from a lower frequency, which occurs with a probabilify flux ](x,f). Populations may leave this range of allele frequencies by drifting out, which occurs with probability flux )(x + Bx,f). The rate of change in <j>(x,t) is the difference in these fluxes, which we can write 278 Chapter 7 •:•„..«, 7 i Portion of the Wrieht-Fisher model for the distribution *(x,0 of count (see Figure 7.12). I^0 = -f ;(x,o 7.3 df 3* because /(*,,) - J[x + 3-*, ~- -£ J(.U)when 3* is very .mall. The probability flux is where Mfr) is the average change in allele frequency in a population whose cur- 7ent allele frequency is I and V(x) is the variance in change in allele frequency. 7.4 Random Genetic Drift 279 M(x) is zero unless there is some force, like mutation or selection, driving the allele frequency to change in a particular direction. (Remember that with pure drift, allele frequency increases and decreases with equal chance ) V[x) tells how fast allele frequencies change for a population with frequency „y Under the Wright-Fisher model, V(x) = x(l - x)/2N, which means that the binomial sam- pling variance describes the magnitude of allele frequency change. But the prob- ability flux depends on the difference in rates of change from x to x + dx, and so, just as m classical physical models of diffusion, the flux depends on the gradi- ent in whatever is diffusing. In the case of chemical diffusion, this gradient would be the gradient in concentrations, and a greater difference in concentra- tions would yield higher flux. In the case of our population genetic model, the gradient is the change in sampling variance as x is varied, or dV(x)/dx. Substituting Equation 7.4 into Equation 7.3, we get £*'*-£ M(x)<p(x,t)-±^V(x)<l>(x r t) -~lM(xW,l)] + ^\V(x)4{x t t)] 7.5 &t' 2 dx- Equation 7.5 is known as the diffusion equation, the forward Kolmogorov equation, or, in the context of the physics of heat diffusion, the Fokker- Planck equation. For the Wright-Fishei model, M{x) = and V{x) = x(1 - x)/2N so we get It'-'^&W-M*'')] dt 4W dx 7.6 Many aspects of this problem were explored by Wright (1931), and the formal solution to this equation, found by Kimura (1955), required some heavy mathematics. For our purposes, some graphs will illustrate the impor- tant properties of the diffusion equation. The two families of curves in Figure 7 6 are the theoretical distributions <p{x,t) of allele frequency among unfixed populations after various times (f) measured in units of N generations In Fig- ure 7.6A, all populations have an initial allele frequency of V2, as in the actual populations in Figure 7.4; after about t = 2N generations, the distribution of allele frequency is essentially flat, and by this time about half the populations are still unfixed. The distributions in Figure 7.6 refer only to those popula- tions that are unfixed; as time goes on, more and more of the populations become fixed, and the distributions progressively pile up at and 1, as in the histograms in Figure 7.4. Indeed, in Figure 7.6, the area under each curve is equal to the proportion of unfixed populations, which becomes progressive- ly smaller. In particular, the rate at which the height of the distribution decreases once it becomes flat is about 1 /2N per generation. 280 Chapter 7 <A> ~ 3 - o 2 - A' = "' /V' = ! - /\' = ! ~^-^ y7» = wv^ y f = 2N VV^ "/* f -V 77 i v_ 0.5 Allele frequency 1,0 05 Allele frequency Figure 7.6 Theoretical results of random genetic drift. (A) Initial allele frequency = Vi- (B) Initial allele frequency = 0.1. The curves have been scaled so that the area under each curve is equal to the proportion of populations in which fixation or loss has not yet occurred. The curves are therefore the distributions of allele frequencies among segregating populations. (From Kimura 1955.) Figure 7.6B shows what happens when the initial allele frequency is 0.1; here the distributions are highly asymmetrical, and the distribution of allele frequency does not become flat until about t = AN generations, by which time only about 10% of the populations remain unfixed. Once a flat distribution of allele frequency is reached, the distribution remains flat, but random drift continues unlil fixation or loss has occurred in all populations. PROBLEM 7.5 Demonstrate from Equation 7.6 that the fixation states, x = and x = 1, are equilibria of the diffusion process. ANSWER A condition for equilibrium is that (fr(x,f) remains station- ary, so that |-0{>,f) = O. Substituting into Equation 7.6, we get Random Genetic Drift 281 i a 2 °~ 4W fc5"N 2 -*)*M]- H * = or x * 1, this equality is dearly satisfied, because * fn] = a It ^^ a W| more W(>rk fo show ^ these are the only equilibria. To illustrate fhal the diffusion approximation and the Wright-Fisher Ztl H 7 Ve Z Simih * TeSU] ! S ' FigUre 7 " 7sh0WS ^diffusion approximation or the data ,n F Ig ure 7.4, with 2N = 32, * = % and , running from generation I through generation 19. Seal n M \TZ S ] T } u°' Ut r iQ the diffusion ^ uation *™ *e particu- enS L H^r S ' S tHe three - dimp ™<™I view of Figure 7.6, and repre- Writh v I fuS,ca \ a P Prowmafon to the exact solution obtained by the Wright-Fisher model in Figure 7.5. y 282 Chapter 7 Absorption Time and Time to Fixation One useful application of the diffusion approximation has been to determine expressions for the expected time for a neutral allele to go to fixation. Assum- ing that the allele starts at frequency p, Kimura and Ohta (1969) showed that the mean time (in generations) until the allele is fixed (ignoring cases where the allele is lost) is 1 (p) = -^[(l-p)log(l-p)] 7.7 Similarly, they showed that the mean time to loss of the allele is 'o(p) = --f^[pMr)] ' 7 - 8 Combining Equations 7.7 and 7.8, the mean persistence time of an allele is 7.9 (p) = -4N[p]og(p) + {l-p)]og(l-p)] where T(p) is the average time that an allele remains segregating in a popula- tion (that is, until its frequency is either or 1). The average times for fixation, loss, and persistence of a neutral allele are shown graphically in Figure 7.8 An allele is expected to remain in a population for the longest time when its initial frequency is >/ 2 . When 4 6 Initial allele frequency Figure 7.8 Average persistence ol a neutral allele in an ideal diploid popula- tion of size N, plotted against initial allele frequency. Random Genetic Drift 283 p,i = V 2 r the average time that a population remains unfixed is about 2.77/V generations. PARALLELISM BETWEEN RANDOM DRIFT AND INBREEDING Consider a set of four subpopuJations each started with allele frequency p = '/ 2 ,and each undergoing random drift independently following binomial sam- pling (Figure 7.9). Within any particular subpopulation (call it subpopulation Initial After 1 39N generations After fixation Figure 7.9 A schematic diagram showing a set of four populations undoi go- ing the process of drift. Initially the allele frequency is V 2 m all four populations, and the average heterozygosity is '/ 2 . As the populations drift in allele frequency, the average is expected to remain the same (indicated by p remaining '/-,) but the average heterozygosity decreases. (Genotype frequencies are given for the inter mediate generation.) Finally, all populations go to fixation, half fix one allele and half fix the other, so the average allele frequency is still '/ 3 , but the hetero- zygosity is zero. 284 Chapter 7 number i), mating is random because all the assumptions in Table 7.1 hold true If the allele frequencies of A and a in the ?th subpopulation are denoted p, and q„ then the genotype frequencies of AA, Aa, and aa are given by the famil- iar Hardy- Weinberg principle as p, 2 , 2p,q, and q, z . Furthermore, picture the sit- uation in Figure 7.9 at a time so advanced that all subpopulations are fixed for one allele or the other. Within the ith subpopulation, therefore, either p, equals or p, equals 1. The genotype frequencies of AA, Aa, and aa in that subpopu- lation are either 0, 0, and 1 (if p, = 0), or I, 0, and (if p, = 1 ). These genotype frequencies, though extreme, still satisfy the Hardy- Weinberg principle. Thus, within any one subpopulation in Figure 7.9, the frequency of heterozygotes is that expected with random mating. The situation regarding the total population in Figure 7.9 is very different, however, as there is an overall deficiency of heterozygotes. Suppose that we sample the four subpopulations, but that we are unaware of the existence of the four subpopulations, and instead we think that the sample contains a sin- gle randomly mating population. Considering the four populations at the right side of Figure 7.9 (after all variation is lost), if we were to calculate the allele frequency we would obtain p = Vi- We would then naively expect a frac- tion 2pq = Vz of the genotypes to be heterozygous. In fact, we would have no heterozygotes at all in our sample! This rather paradoxical result— that there is a deficiency of heterozygotes in the total population even though random mating occurs within each subpopulation— is a consequence of the random genetic drift of allele frequencies among subpopulations due to their finite size. This extreme case when each subpopulation is fixed is easy to under- stand: a population with allele frequency V 2 could only be made up of two subpopulations each fixed for A, and two subpopulations each fixed for a. The entire population has no heterozygotes whatsoever, but the average allele frequency is V 2 . The total population has a deficiency of heterozygotes, much as if there were inbreeding. This inbreeding-like effect of population subdivision is known as the Wahlund principle (Chapter 4), and we are now TABLE 7.1 ASSUMPTIONS OF MODEL OF RANDOM GENETIC DRIFT (1) Diploid organism (2) Sexual reproduction (3) Nonoverlapping generations (4) Many independent subpopulations, each of constant size N (5) Random mating within each subpopulation (6) No migration between subpopulations (7) No mutation (8) No selection (F = 1) Random Genetic Drift 285 f -l°oooooooo 1\ I I I 1 I J I t c> OOOOOOOO Figure 7.10 Diagram illustrating the reasoning behind the recursion for F in a finite population. When the gametes are drawn to make up the population at generation f, there is a chance 1/2N that any pair of alleles will be drawn in generation t ~ 1. If this happens, the probability of identity is 1. For the allele pairs drawn in generation t from two distinct alleles at generation f - 1 (the probability of this is 1 - 1/2N), the probability of identity is F M . Adding the probabilities of these two events, we get F, = 1 /2/V + (1 - 1 /2W)F M in a position to quantify the manner in which subpopulations diverge in allele frequency under random genetic drift. In Chapter 4 we measured the extent of inbreeding with the inbreeding coefficient, F. F is the probability of autozygosity, or the probability that an individual carries a pair of alleles that are identical by descent (derived from a common ancestor). Even though random mating occurs within each sub- population in Figure 7.9, because gametes do combine at random, any two alleles in a subpopulation may be identical by descent due to the limited pop- ulation size. Thus F, does not equal zero. The value of F, can be calculated as in Figure 7.10. This figure shows the 2N alleles in a breeding population of generation f - 1. In sampling alleles for generation /, the first chosen allele may be any of those present in generation t - 1 with equal chance. The prob- ability that the second chosen allele is of the same type as the first is 1 /2N, because this is the frequency of each allelic type in the gametic pool; the probability that the second chosen allele is of a different type from the first is accordingly I - 1/2/V. In the first case, the probability of identity-by-descent is 1; in the second case it is F M . Altogether the recursion is t-M*-^ 7.10 Multiplying both sides by-] and adding 1 leads to and '-t-k-^Vv 7.11 286 Chapter 7 TOO 150 200 Gent-rations (0 Figure 7. 11 Increase of F, in idea! populations as a function of time and effec- tive population size N, or, when F - 0, -^~h] 7.12 Figure 7.1 1 shows the rapid increase of F, in small populations. Another aspect of the same phenomenon can be appreciated by the probability of drawing a pair of alleles that are not identical by descent This probability is the same as the heterozygosity, and it can be written H, = 1 - F, 7.13 By substitution for F, we obtain the rate of change in heterozygosity from random genetic drift H, = 1 2W H r _ 7.14 and so H ' = ( 1_ ^)' Hn ' BH0f "' / Random Genetic Drift 287 7.1b Recall again that a single population undergoing random drift remains in approximate Hardy-Weinberg proportions, and that the symbol H, rep- resents a sort of "virtual heterozygosity" averaged across many subpopu- lations. The above equations show that pure random drift should result in the heterozygosity decreasing at a geometric rate, since U, is multiplied by the constant (1 - 1 /IN) each generation Experimental tests of this predic- tion are shown in Figure 7.12 Figure 7 12A shows how the heterozygosity averaged across the populations in Figure 7.4 declines over generations, but the theoretical curve when N = 16 does not fit the data very well, [n fact, the rate of decline of heterozygosity is gteater than the theoretical expectation, as though the population size were smaller than N - 16 On the other hand, the allele frequency, averaged across populations, is not expected to change, and the data agree with this aspect of the theory quite well (Figure 7.1 2B). PROBLEM 7.6 Use Equation 7.15 to determine how long it takes for a finite population to halve in heterozygosity. ANSWER Set V 2 H = H e"<' /2N> . Dividing out the H n , and taking log- arithms, we obtain K3-5 2N so I - 1.39N. In other words, it takes 1.39N generations to halve the heterozygosity, regardless of its initial value. Fisher expressed this result by saying that it takes 1 39N generations to halve the genie vari- ance in the population. Since the variance of a binomial sample is pq/2N, and the variance in allele frequency among subpopulations is proportional to the heterozygosity, it follows that both the variance and the heterozygosity decrease at the same rate. 288 Chapter 7 §3 04 Points from Tigurc 7 4 Theoretical cm ve with N = 16 I ° 01 Theoretical curve with N = 9 5 10 Generation (f) (B) > o re -n 08 K 3 Sg. 06 g5 04 8" 2 'Theoretical curve i.tl.ii ' " ' *" / Points from (A) 5 10 15 19 Generation (f) Fiqure 7.1 2 Theoretical curves for average heterozygotes (A) with N = 9 or N = 1 6, along with actual values (plotted as points) from the experiment in Fig- ure 7 4 In (B) the observed and expected allele frequencies (averaged across the 107 subpopulations) are plotted. (Data from Buri 1956 ) Several important consequences of the population structure in Figure 7.9 can now be summarized. First, although each subpopulation is finite in size, we can imagine so many of them that the size of the total populate is effec- tively infinite. For an infinite population that obeys the assumptions in Table 7 1 the allele frequencies must remain constant. That is, even though the allele frequency in any individual subpopulation may change willy-nilly due to random genetic drift, the overall average allele frequency of A among sub- populations remains P(lr where p represents the allele frequency of A m he base population. Figure 7.12B gives an experimental demonstration of the Random Genetic Drift 289 constancy of average allele frequency. Since F, is the probability of atitozy- gosity of a gene in an individual in generation t t the probability of alto/ygos- it|y (obtaining a pair of alleles that are not identical by descent) is 1 - F,. Because p„ is the overall allele frequency of A, the probability that a rnndom- lyichosen individual will be genotypicafly AA is ;^(1 - F,j [for the case of allb7yg0sity] + p F, [for the case of autozygosityj. Similarly, the probability that the individual will be An equals 2/),fl n (1 - F,); and the probability that the individual will be an equals flofl - F,) + t] t) F,. Note that the genotypic frequencies in the total population are different from the standard Hardy- Weinberg proportions, because there is an apparent excess of homozygotes. However, within any one subpopulation, the genotypic frequencies still obey the Hardy-Weinberg principle because of random mating. Substituting for F, in Equation 7.12 implies that the average heterozygosity among subpopu- lations at time f equals 2mo(1 - f i) = 2/WoO - 1 /2N)'; this is the theoretical curve plotted in Figure 7.12A (with p = q Q = '/,) Since F, eventually goes to 1, all subpopulations eventually become fixed for one allele or the other. Because the average allele frequency of A remains p Q even when all subpopulations have become fixed, the proportion of subpopulabons that eventually become fixed for A must be rj (and the proportion that eventually become fixed for a must be q ) Stated another way, the probability of ultimate fixation of an allele in any ideal subpopu- lation is equal to the frequency of thatjallele in the initial population This point is illustrated by the actual exanjple in Figure 7.4, where p„ = >/ 2 ; by generation 19, a total of 58 populations have become fixed, 30 for the bxv allele and 28 for W 5 . I EFFECTIVE POPULATION SIZE As we saw in the Drosophih experiments in Figure 7,12, populations general- ly fluctuate in allele frequency by an a 1 mount greater than pq/2N The reason is that no real population obeys all the assumptions in Table 7.1 exactly. In any actual case, there must be corrections for such complications as fluctua- tions in population size, unequal number^ of males and females, age struc- ture, and skewed distributions in family size (see Crow and Kimura 1970) The degree to which genetic drift can change allele frequencies, and the rates of allele fixation by drift, can be approximated under these complicating cir- cumstances by calculating the effective size of the population and using this value in the theory for an ideal population. That is, the effective population size of an actual population is the number of individuals in a theoretically ideal population having the same magnitude of random genetic drift as the actual population. There are three kinds of effective population size based on how we choose to measure "magnitude," namely: (1) the change in average inbreeding coefficient, (2) the change in variance in allele frequency, or (3) the 290 Chapter 7 rate of loss of heterozygosity These are called the inbreeding effective size, the variance effective size, and the eigenvalue effective size, respectively. Wright (1 931) First worked out the el fective population size by considering the effective degree of inbreeding in various situations. As noted, the effective population size can also be calculated by determining the rate of change m variance in a population, and Kimura and Crow (1963) first applied this approach to the problem of overlapping generations. Usually, the inbreeding effective size and the variance effective size are the same, but exceptions do occur Similarly, the variance effective size and the eigenvalue effective size can be distinct (Ewens 1982), Some of the various factors that require calcula- tion of an effective population size will now be illustrated. We will focus on the inbreeding effective size because this concept is the most widely used. Fluctuation in Population Size Correction for fluctuating population size is important because natural pop- ulations actually do change in size, sometimes by a factor of 10 or more in a single generation. For the sake of simplicity, assume that the population is ideal in all respects except that its size is not constant. We will consider the situation over just two generations. Suppose that the population sizes in two successive generations are N and N,. The arguments laid out in Figure 7.10 imply that 7.16 and i-^H-5-ki-r,) '-H-^H Substituting from the second equation into the first leads to 1 - F, = 1 - 2N, '-^P-*' 7.17 7.18 By analogy with the constant N case, it is appropriate to try to express this equation in the general form ,-*-(>-£)'<.-«.) where N is now the effective population size. In our example f = 2, so ,-*=(,-£)<!-*) 7.20 Random Genetic Drift 291 Setting the two expressions for 1 - F 2 equal to each other we obtain s 2 2N) { 2N \ — 2N, 721 from which 1/N = V 2 (l/N n + 1/N,) turns out to be an excellent approx. tion. In general, 1 1 — + — + , t[N N, N,- 7.22 and so the effective size N P is the harmonic mean of the actual numbers— the reciprocal of the average of reciprocals. As illustrated m the problem below, the harmonic mean tends to be dominated by the smallest terms, m biological reality, this means that a single period of small population size, called a bottleneck, can result in a serious loss in heterozygosity. Popula- tion bottlenecks are thought to account for the very low levels of polymor- phism found in extant populations of the elephant seal (Bonneli and Selander 1974) and the cheetah (O'Brien et al. 1985, 1987). A severe popula- tion bottleneck often occurs in nature when a small group of emigrants from an established subpopulation founds a new subpopulahon; the accompanying random genetic drift is then known as a founder effect (see Holgate 1966; Nei et al. 1975; Chakraborty and Nei 1977; Neel and Thomp- son 1978). Founder effects in human populations have implications in med- ical genetics, because human populations derived from small numbers of founders may have an elevated incidence of an otherwise rare genetic dis- order. Examples include Tay-Sachs diseases in Ashkenazi Jews, dystrophic dystrophy in Finns, familial hyperchylomicxonemia in Quebecois, and con- genital total color blindness in Pingelap Islanders, In addition to reducing the effective size, and thereby increasing F, population bottlenecks and founder effects may affect many other aspects of the genetic variation, including causing a reduced number of alleles, a distorted distribution of numbers of molecular site differences among alleles, and an increased level of linkage disequilibrium. PROBLEM 7.7 Suppose a population went through a bottleneck as follows: N Q = 1000, Nj = 10, and N 2 = 1000. Calculate the effective size of this population across all three generations. 292 Chapter 7 ANSWER Using Equation 7.22, we get 1/N = <1/3)(1/1000 + 1/10 + 1/1 000) = 0.034, or W = 1 /0.034 = 29.4. The average effective number over the three-generation period is only 29.4, whereas the arithmetic average number of individuals is (1/3)(1000 + 10 + 1000) = 670. Unequal Sex Ratio, Sex Chromosomes, Organelle Genes A second important case in which the effective size of a nonideal population can readily be calculated concerns sexual populations in which the number of males and females is unequal. This inequality creates a peculiar sort of "bottleneck"; because half of the alleles in any generation must come from each sex, any departure of the sex ratio from equality will enhance the oppor- tunity for random genetic drift. This situation is important in wildlife man- agement, where, for many game animals (pheasants and deer come immediately to mind), the legal bag limit for males is much larger than for females. Although some management goals are served by such hunting reg- ulations (for example, the species involved are usually polygamous, so one male can fertilize many females and overall actual population size can be maintained), it must be remembered that the resultant inequality in sex ratio reduces the effective population size. Specifically, if a sexual population con- sists of N m males and Nf females, the actual size is However, the effective population size is 4N m N f K = N m +N f 7.23 7 24 Figure 7.13 shows the relationship between sex ratio and the reduction in effective population size. To take a realistic example, if hunting is permitted to a level at which the number of surviving males is one-tenth the number of females, then the effective population size is a mere one-third of the actual number of individuals in the population. A related problem is the effective population size for an X-linked gene. In this case, the variance effective population size is N r = 9N„,N f 4N„, + 2N, 7.25 Random Genetic Drift 293 40 60 Percent female Figure 7. 1 3 Effective size falls off rapidly in populations with a skewed sex ratio Equation 7.25 can be justified by noting that the sampling variance for the X chromosomes from males is p„,cj m /N m , whereas the sampling variance for X chromosomes from females is p f q f /lN f , in which p,„ and p f are the frequencies of allele A in males and females, respectively The frequency of an ,4-bearing X chromosome in the population is P= 3 Pm + 3 Pf and the sampling variance of p is At steady state, p m = p f = p, so pq can be factored out, giving Var(p) = pq 1JL 9N„. + 1-1 9 2W 'f VH 9N„,N f 4N m +2N f 7.26 7.27 7.28 The term in the square brackets corresponds to the N c in Equation 7.25. It shows why this is a variance effective size: the binomial sampling variance in an ideal population is pq/2\N e ], 294 Chapter 7 PROBLEM 7.8 What is the effective population size for mitochon- drial DMA? (Assume transmission is exclusively from mothers to all offspring.) What is the effective population size for a gene on the Y chromosome, given that the population consists of N diploid individ- uals and all other assumptions of Table 7.1 apply? (Assume XX indi- viduals are female and XY individuals are male.) AN SWE R Mitochondrial DN A is transmitted essentially exclusively by females. The chance of drawing two mtDNAs that are identical by descent is 1 /N f , where N/is the number of females in the population. Hence the effective size is simply ty Similarly, the effective popula- tion size for the Y chromosome is N„, the number of males in the pop- ulation. Note that even though mtDNA is present in all individuals, while the Y is present only in males, the effective size of mtDNA is not larger. Effective size depends on the sampling properties of a gene, which depends on the gene's transmission, not just on how many individuals carry the gene. BALANCE BETWEEN MUTATION AND DRIFT There arc many forces in population genetics that act in opposition to one another, and it is this tension that makes for interesting behavior at the pop- ulation level. Mutation always increases the amount of genetic variation in a population Random genetic drift results in the loss of genetic variation. Merely because these two forces are in opposition, it does not guarantee that there will be a stable balance between them. In order to formally ask whether the two forces do balance, we need to be careful to specify assumptions about the processes of mutation and drift. We already examined one such model in Chapter 5 — the infinite-alleles model— and we saw that in this case the forces do in fact balance to provide an equilibrium level of neutral variation. Let's consider this model once again, in somewhat more detail. INFINITE ALLELES MODEL As we saw in Chapter 5, the infinite alleles model starts with the assumption that each mutation produces a novel allele, never before present in the popu- lation Mutations occur such that each gene in the population has an equal, but low, chance of mutating. Random genetic drift occurs in the manner of Random Genetic Drift 295 the Wright-Fisher model— each generation the population is reconstituted by drawing a sample with replacement from the current sample of alleles Under these assumptions we saw that the equilibrium probability of identi- ty, F, could be approximated as F = 1 4Nu + l 7.29 The number of selectively neutral alleles increases under mutation pres- sure until F satisfies this equation. PROBLEM 7.9 Derive an expression for F in a finite population with mutation and migration. ANSWER First assume that there is no mutation, and that new migrant alleles arrive from another population at a rate mi per genera- tion M in the balance between mutation and drift in the infinite- alkies model, we note that alleles can be identical by descent by being drawn twice (with probability 1/2N) or by having two different alle- les drawn but having them be 1BD from the previous generation. The e<|uilibrium autozygosity can be written *(-#}- because (1 - m) is the probability that neither of the two randomly chosen alleles comes from a migrant. By analogy with the infinite- aUeles model, we get (in the case of migration with no mutation) F = 1 4Nm + l When both migration and mutation are occurring, alleles are identical by descent only if they neither mutated nor migrated, and this occurs with probability 1 - m - ji. Thus, thensgntlibrium autozygosity is 4N(m + ii) + l 296 Chapter 7 F is the probability of autozygosity, and //, which can be thought of as het- erozygosity/ is also the probability of drawing a pair of alleles that are not autozygous. Since H - 1 - F, under the inhmte-alleles model, the equilibrium of His H = 4Nu 9 l + 4Nu 9 + 1 7.30 where 9 = 4Nu. The relationship between the quantity 4Nu and H was encountered in Chapter 5 and is plotted in Figure 5.7. For a per -locus mutation rate of 10~ 6 and a population size of 250,000, we get 9=1, and so H = V 2 . Note that increases in population size have precisely the same effect as increases in mutation rate. Heterozygosity approaches one only if population sizes are very large (such as in microbial organisms) or if mutation rates are very high (such as at some microsatellite loci) Next we will consider how one might go about testing whether a sample from a population exhibits a pattern of genet- ic variation that is compatible with the infinite-alleles model. The Ewens Sampling Formula The infinite-alleles model has an "equilibrium" when H = 47Vu/(l + 4Nu). Tins is not an equilibrium in the usual sense. In reality, allele frequencies are always changing, new mutations continue to come into the population, and eventually they are eliminated, even perhaps, after becoming fixed for some time. The term steady state is probably more appropriate for this kind of behavior, since the alleles are not maintained at a constant frequency, but rather new ones are entering and old ones are leaving the population. The population remains at a steady state in the sense that the number of alleles, and the level of autozygosity, remain stationary. If the number of alleles and the level of autozygosity remain steady, then it is reasonable to assume that there is also a steady-state distribution of allele frequencies. By steady-state frequencies we mean that the most common allele always has a frequency of p u and the next most common has a frequency of p 2f and so on. The steady- state distribution has the curious property that, even though the most com- mon allele is expected to have a frequency of p u the identity of the most common allele is expected to change with time. In the steady-state popula- tion, not all alleles are equally frequent, and F is greater than it would be were all alleles equally frequent. Consider the steady-state distribution of allele frequencies from the point of view of an experimenter taking a sample from a population. Let the sam- ple size be // genes, and suppose there ate k different alleles in this sample. The sample might consist of, for example, 10 unique alleles, 3 alleles that are represented twice in the sample, 7 alleles that are present 3 times, and so on. Such a description of the sample is called the allelic configuration or parti- Random Genetic Drift 297 tion. A remarkable finding of Ewens was that the expected configuration of a sample drawn from a population obeying the infinite-alleles model is entire- ly determined by the sample size, n, and the number of observed alleles k Ewens showed that the expected number of alleles in the sample, given 9 and the sample size, is E(fc) = l + _^ + _^- + . + __JL„ e+i e+2 e+«-i 7.31 If 6 is very small, E(k) « 1, whereas for very large 8, E(k) approaches n, implying that for a large enough population with a high enough mutation rate, every allele that is sampled will be different The form of Equation 7.31 suggests that, as the sample size increases, more alleles will be found, but that there is a diminishing return in finding new alleles when the sample size increases. When E(k) is plotted against (Figure 7.14), the increase in the expected number of alleles is greatest for larger sample sizes when the popu- lation is highly diverse (large 0). The infinite-alleles model gives a steady-state prediction of F given 9 (because F = 1/(1+9) from Equation 7.30), and a prediction of It Irom Equa- tion 7.31. Combining these predictions, the expected relation between Fand k is plotted in Figure 7.15. The hyperbolic relation is not surprising, because a population with many alleles will generally have a lower probability of iden- tity of a randomly chosen pair of alleles. For 6 = 1, the expected F is V 2 for all Figure 7.14 Relations between 6, the expected number of alleles, and the sam- ple size according to the Ewens-Watterson sampling theory of a population in steady-state under the infinite-alleles model of neutral mutation 298 Chapter 7 1.0 r 09 OR - 07 Oft [6=0 1 e = io ir = 50 if = 100 n = 250 4 6 8 10 12 H 16 Expected number of alleles, E(k) 18 20 Figure 7.15 The infinite-alleles model prediction of the relation between the expected number of alleles and the expected gene identity F. The three curves represent a range of values of 9 = 4Nu, starting at 6 = 0.1 in the upper left, and ending with 8 = 10 in the lower right For the value of B = 1, the expected F, given by the relation F = 1 /(l + 6), is '/ 2 , regardless of the sample size. Larger sample sizes always lead to larger expected numbers of alleles, but the differ- ence is greater in more diverse populations (those with smaller F). sample sizes, but a larger sample size should yield a greater number of dis- tinct alleles. The Ewens-Watterson Test The Ewens sampling theory expressed in Equation 7.31 shows that the sam- ple size and the number of distinct alleles observed in the sample are suffi- cient to give an expected configuration of allele counts. From the observed and expected configurations, a number of test statistics can be devised to determine whether the observed sample fits the expected values of the model. Figure 7.16 shows histograms of the observed and expected allele fre- quency configurations for alleles in a human population defined by a VNTR polymorphism. In this particular example, there appears to be a slight excess of the common allele, which is consistent with any number of causes of departure from the infinite-alleles model Keith et al (1985) isolated 89 homozygous lines from a sample of Drosaphila pseudiwbficttra collected at the Gundlach-Bundschu Winery in Sonoma Valley, California. Homogenized tissue from these 89 lines was then Random Genetic Drift 299 £04 f £ 02 < IllliTfMlTgT m—. More common -*• Less common Allele rank Figure 7.16 Observed (open columns) and expected (black bars) allele fre- quency distribution of the HRAS-1 locus in humans, identified by Southern blotting with the pLMO.8 probe and Taql digests. Observed data are from Baird et al. (1986), and the expected distribution was generated using Ewens' sam- pling theory In this sample of 490 genes there were 14 distinct alleles, four of which were present in fust one individual (From Clark 1988.) subjected to sequential electrophoresis (a sensitive means of detecting charge and conformation differences among the protein products), and stained to reveal differences in xanthine dehydrogenase {Xdh) mobility. They obtained a common allele that was present in 52 of the lines, one allele that was pre- sent in nine lines, one allele that was present in eight lines, two alleles present in four lines each, two alleles that were present in two lines each, and eight singleton or unique alleles. To test whether the observed configuration fits the expectation, a comput- er simulation was run to generate realizations of samples from populations that obey the infinite-alleles model, having the same number of alleles and sample size as the observed data. The algorithm to do this simulation is described by F Stewart in the Appendix to Fuerst et al. (1977), and a listing of a program can be found in Manly (1985). From each computer-generated sam- ple, F is calculated as the sum of the squared allele frequencies Figure 7 17 shows a histogram of the computer-generated distribution of F, along with an arrow showing where the Dmsophila sample fell The sample had an observed F that fell in the upper tail of the distribution, and since so few val- ues ol F from the null hypothesis were larger than the observed F, Keith etal. rejected the null hypothesis and argued that the data did not fit the infinite- alleles model satisfactorily. The departure was in the direction of excess homozygosity (deficit of H), but since the populations were probably in Hardy-Weinberg proportions, a clearer way to state the result would be to say that there was a deficit of genetic diversity for the given number of observed alleles. The deficit means that the common allele is more common than 300 Chapter 7 KoHhctal. observed F = 3657 hnLrrrnJ- r Figure 7.17 Computer-generated distribution of F obtained from 1000 sam- ples from a population obeying the assumptions of the infinite-alleles model with k = 15 alleles and a sample of size n = 89 (as in the Xdh data irom a sample of Drosophiltt psciidmbsam from the Gundlach-Bundschu Winery studied by Keith et al. 1985). The mean of F trom the simulation wasQ 168, which is well below the observed F of 0366 A significant departure of the observed F from the predictions of the model is noted by the small area under the tail of the dis- tribution to the right of the arrow expected, and there are also more singletons than expected. This pattern of fre- quencies is consistent with purifying selection acting to eliminate the rare, slightly deleterious alleles that continually enter the population by mutation. It is also consistent with an historical effect in which many alleles may have been previously lost and the population has not yet had time to return to equilibrium. The results of the Ewens-Watterson test can also be reported graphically as in Figure 7.18. Each gene yields a point specified by the number of distinct alleles and the observed F. The two curves represent the 95% confidence interval generated by the Ewens sampling theory. A quick check of the con- cordance of the data with the model can be made by seeing whether points remain in this confidence region. Although Xdli in Drosophih psettdoobscure provides a dramatic departure from the infinite-alleles model, results like those plotted in Figure 7.18, which show an acceptable fit to neutrality, are more commonly obtained. INFINITE-SITES MODEL Rather than considering each mutation as generating a unique allele, with infinitely many possible alleles, we can instead consider an allele as a sequence of nucleotides with mutation altering a site in the sequence. If the mutation rate is sufficiently low, then most sites will be monomorphic, and all polymorphic sites will be segregating for just two nucleotides Much of Random Genetic Drift 301 ].U 09 • Got \ OS - G6PD *\ Mdh\ 0.7 - 0.6 - F 0.5 - m m X. 04 PEP • . ptf V Aco 0.3 - \\ Adh f 6PGD 02 - 0.1 - i i i i i j i • PG4 i i i I 1 I JO 12 14 16 18 Numbei of alleles {k) 20 22 24 26 Figure 7.18 Gene identity (F) plotted against the observed number of alleles in a sample of 279 E. colt. The solid lines represent the upper 97.5% and lower 2 5% confidence limits, and the observation that all of the tested loci fall within these limits suggests good concordance with the infinite-alleles model of neutral mutation. (From Whittam et al. 1983.) the available data on allelic variation in DNA sequence seems consistent with this view: few nucleotide sites are segregating for more than two nucleotides. If the DNA sequence is sufficiently long and the frequency of polymorphic sites low, then most of the time new mutations will occur at sites that were previously monomorphic The infinite-sites model, based on these assump- tions, was developed by Kimura {1969, 1971), who considered nucleotides as unlinked, and by Watterson (1975), who took account of the nearly complete linkage among sites. The infinite-sites model is appealing because it directly addresses the type of data that molecular population geneticists can collect. Given an array of DNA sequences of alleles randomly sampled from a population, there is con- siderable information about the history of the alleles hidden in the patterns of similarity across alleles. The infinite-alleles model ignores this pattern and simply considers the alleles as distinct. A much more powerful treatment is to tabulate the number of sites at which all pairwise combinations of 302 Chapter 7 sequences differ, resulting in a so-called mismatch distribution. The infinite- sites model addresses the theoretically expected behavior of the mismatch distribution. Watterson (1975) considered the distribution of S„ defined as the number of segregating sites in a sample of i genes. For the case of a random sample of two genes, Watterson showed that the steady-state probability that the sequences have ; mismatches is ■Ms.= = (^ » + l 7.32 where 6 = 4N|u, and u is the mutation rate per gene (not per site). A particular case of this equation gives the probability that two sequences have no sites different, and hence are identical. Substituting i = into Equation 7.32, we get Pr(S 2 =0) = e+i 7 33 in agreement with the inlinite-alleles model, because Pr(S 2 = 0) = F, the proba- bility that two alleles drawn at random are identical. The mean and variance in the distribution of number of segregating sites are 9 and 6 + 6 2 , respectively. In reality we do not sample an entire population, so it is important to deter- mine the statistical properties of a smaller sample drawn from a population. Often the sampling properties of population genetic models are very complex and we have to resort to simulations for meaningful eslimates A few results have been obtained for samples drawn from a population obeying the infinite- sites model, and these results are very useful foT testing goodness of fit to the model The expected number of segregating sites in a sample of n alleles is E(s)=eX7 and the variance in the number of segregating sites is 7.34 7.35 This expression for the variance is for the case of no intragenic recombi- nation It turns out that intragenic recombination does not affect E(S), but it reduces V{S) This is not hard to see intuitively — recombination shuffles the variation among alleles, reducing the average number of sites by which ran- dom pairs of alleles differ. The expression for the variance in the number of mismatching sites in the case of free recombination across sites is V(S) = n + \ 2(» 2 +H + 3) ) 3(»-1) + 9h(h-1) 7.36 Random Genetic Drift 303 200 r IflHfc— J _ Number of segregating sites Figure 7.19 Equilibrium distribution of the number of mismatches between a pair of alleles. Note that if there is free recombination, the variance is smaller compared to the case of no recombination. Figure 7.19 shows the mismatch distributions for a simulated set oi data with free recombination (smaller variance) and with no recombination (larg- er variance). The relationship between the mean and the variance in the mis- match distribution can be used to make inferences about intragenic recombination (Hudson 1987). The assumptions of the infinite-alleles model and the infinite-sites model do not seem to be entirely at odds with one another, and we saw that they predict the same steady state value for F But the two models do make use of different aspects of the data, and so it would seem that a test of the consis- tency between the two models might serve as a useful test for the neutral the- ory. The next problem makes use of just this test, which was devised bv Tajima (1989). y PROBLEM 7.10 The average heterozygosity for pairs of randomly chosen alleys under the infinite-alleles model is E(k) = 8, and the expected number, of sites segregating in a sample (under the infinite- sites model) is E(s)=egl 304 Chapter 7 Two estimates of are therefore k, the average heterozygosity, and St Tajima (1989) devised a test statistic to test the null hypothesis that these two estimates were identical. The test statistic is the difference between these two estimates of B, or ,=1 ' If a population were growing rapidly, one might expect this to affect both the number of segregating sites and the heterozygosity. Predict the direction of change of F (probability of identity), S {the number of segregating sites), and D (Tajima's test statistic). ANSWER First consider a larger population at equilibrium. Since F = 1 /(4Nu + 1), a larger population would have a lower F. A larger population would also have a larger number of segregating sites (S), and a higher per-site heterozygosity (fc). At equilibrium, if Ihe gene is neutral, then the Tajima statistic should be zero. In a growing popu- lation, F will decrease as added variation accumulates, S will increase and k will increase. The key point is that the increase in vari- ation will occur in initially rare alleles, which contribute to S but only a little to fc. Thus, S grows faster than k, and D will be negative. If the population stops growing, then Tajima's D statistic will return to zero at equilibrium. GENE TREES AND THE COALESCENT A sample of genes from a population represents more than a snapshot of counts of alleles in a population. Each gene that is sampled has an ancestral history dating back hundreds or thousands of generations. It is possible lhat a pair of genes sampled today may have come from identical copies of the Random Genetic Drift 305 same allele produced by the same individual just a few generations ago Or the alleles may have had common ancestry hundreds of generations ago. The term coalescence refers to this process, looking backward in time, and seeing how two genes merge at times of common ancestry. Along this process, one goes from a sample of k genes, to fc - 1 ancestors after the first coalescence, to fc - 2 ancestors after the second coalescence, and so forth until there is a single common ancestor for the whole sample The idea of the coalescent is to con- sider the ancestral history of genes in a sample by developing a model for the time to common ancestry (Kingman 1980). To understand how the coalescent process works, consider in Figure 7.20 what happens as time moves forward. In each generation there are a number of alleles in the population, and those alleles may be reproduced and be pre- sent in the following generation (moving down the figure), or, in some cases, an allele is not reproduced and is lost from the population. By chance, some alleles may be sampled twice in constituting the next generation, and the probabilities of these events are the same as those under the Wright-Fisher model of random genetic drift. By a repetition of this process over time, even- tually one of the original alleles will become "fixed" in the population. In the absence of mutation, the population would therefore be fixed for the same Figure 7.20 Diagram showing paths of ancestry of a set of alleles sampled at the present. The population is represented as having a constant size Starting at the top and working down, notice that many alleles go extinct, and one allele goes to fixation. Considering this process in reverse, the current sample observed at present undergoes a series of coalescence events in which the £ alle- les present in the current generation had only k - 1 ancestors. This process con- tinues backward in time until there is only one ancestral allele. 306 Chapter 7 allele; however, because mutation may occur during the process, the alleles observed at the present will not all be identical in nucleotide sequence, even though they all descended from a single common ancestral allele. In reality we do not have the genealogical information enabling us to fol- low all the alleles through time in a population. Typically what we have is a single "snapshot" represented by a small sample of alleles taken at the pre- sent time. Now consider Figure 7.20 again, but this time look at what hap- pens when we go backwards in time. We start with the k alleles in the sample at generation 0. In going from generation to generation 1 (one generation ago), we see that the two rightmost alleles "coalesced" into a single ancestral allele. As we go further back in time, the number of ancestral alleles has to either remain the same or decrease, and each reduction in the number of ancestral alleles is called a coalescence event In order to show how this idea can be extended to derive expressions for the entire distribution of branch lengths of a gene tree, we next specify a model. Consider two alleles. The probability that the two alleles came from the same allele in the previous generation is 1 /ZN (in a diploid population of size N), so the chance that they came from two distinct alleles the previous gener- ation is 1 - 1/2N. The probability that three alleles had three distinct ances- tral alleles the previous generation is Pr(alleles 1 and 2 have distinct ancestors)Pr(al!ele 3 is different from both 1 and 2) = (1 - 1 /2N)(1 - 2/2N). In general, the probability that k alleles had k distinct parental alleles the previ- ous generation is ■m*)=iii 2N 7.37 Each generation the sampling process occurs independently of what hap • pened before, and so the probability that k alleles had k distinct parental alle- les two generations ago is the square of the right-hand side of Equation 7.37. Consider two alleles again. Suppose we wish to know the chance that the common ancestor of these two alleles occurred exactly t generations ago In this case there must have been no coalescence (i.e., two distinct ancestral lin- eages were found) for t - 1 generations, and then, in the next preceding gen- eration, a coalescence occurred. The probability of not coalescing for t generations is (1 - 1 /2N)' and the chance of the two alleles coalescing in any one generation is 1/2W. The desired probability is the product of these or Pr(2 alleles had common ancestor t generations ago) = -zrrfi ~ (1 / 2N )1 2N 1 2N ,-lf(2N\ 7.38 Random Genetic Drift 307 Th Th ie exponential is an approximation that is quite good when 1 /2W is small - .us distribution has a mean of 2W generations and a variance of (4N) 2 Note that the confidence interval around the mean time is not very hght, since the standard deviation of the distribution is equal to the mean Returning to our sample of k alleles, the probability that the k alleles do not coalesce for f generations, then one pair coalesces to give k - 1 alleles at r + 1 generations ago is as follows: Pr(fr ancestors for f generations, ft - 1 ancestors at / + 1 generations ago) = Pr(*) , [l-rr{fr)] W eXp k 2 2N 7.39 This approximation is valid if k « N. The distribution in Equation 7 39 has a mean of4N/[k[k - 1)] generations and a variance of \6N 2 /[k(k - I)} 2 . Figure 7.21 shows what thegene genealogy is expected to be. Starting with five alle- les, the first coalescence is expected to occur 2N/10 generations ago, the next at 2N/6 generations prior to that, and so on. Note that the time intervals get Past Present T 2 E(T 2 ) = 2N r, rirj-.^ It r.(T t ) = ^ h r(7,) = 2? Figure 7.21 The process of coalescence can be represented by a cone tree At each generation, if there are k alleles present, the expected time back to the next coalescence is 2N /^ j . Starting with five alleles, the expected time back to the first coalescence is 2W/10. Note that the successive times got longer. When Ihere are only two alleles, the time back to the final coalescence is 2A/ generations 308 Chapter 7 longer and longer as the number of lineages decreases. The distribution of each of these time intervals is exponential, with ever -increasing means as one goes back in time The time to the coalescence of all of the k alleles (i.e , the most recent time that one sample of 11 alleles shared a common ancestor) is T f = 4N(1-l/fc) 7.40 with variance v = 4 N 2TlJL 7.41 't (Kingman 1 982; Tajima 1983) As the sample size * increases toward the total population size, f approaches 4N, which equals the expected fixation time for a newly arisen mutation. These principles allow us to generate simulated gene genealogies whose branch lengths correspond to the assumptions of the Wright-Fisher model. One thing the model still lacks is mutation, which is introduced in the next discussion. Coalescent Models with Mutation In order to generate simulated gene sequence data representing samples drawn from a population obeying the infinite sites model, Hudson (1990, 1993) showed that one can procped as follows: • Determine the sample size k and the 6 for the gene region of interest; • Draw random numbers with appropriate exponential distributions to construct a gene genealogy such that times of coalescence follow Equa- tion 7.39; • On each branch of this tree, distribute mutations with a Poisson distribu- tion on each branch, such that the mean number of mutations on each branch is given by 2N\xt, where t is the branch length. This procedure has been widely used in generating data sets under the neutral hypothesis for comparison to observed data sets. From Figure 7.21, it follows that the sum of the branch lengths for the entire gene tree is T = £'T, 7.42 The expected number of segregating sites in the whole sample is 2NuT, where T is the sum of the branch lengths, so substituting we get E(S) = 2N/iT = |X'E(7;) = 9SJ 7.43 Random Genetic Drift 309 The rightmost expression agrees with Equation 7 34, which we derived for the infinite-sites model The coalescent approach can be used to derive many fundamental princi- ples in population genetics. As one example, consider a population presently m mutation-drift equilibrium, In the previous generation, a pair of alleles can either coalesce, with probability 1 /2N, or failing to coalesce, one or the other allele may mutate with probability 2p. (The factor 2 comes in because either copy can mutate.) These are the only two events that affect identity, and the sum of their probabilities is 1 /2N + 2u The probability of identity is therefore the fraction of the time that the alleles coalesce; F = _ 2N _ 1 2N + * i+e 7 44 We have already derived this equilibrium identity under the infinite-sites (and infinite-alleles) models. Coalescence methods are not limited to the con- sideration of the Wright-Fisher model. If one can develop a recursion equa- tion for probabilities of recombination, migration, or other such phenomena in a gene tree context, then often powerful insights can be derived from coa- lescence approaches. For our purposes, suffice it to say that the method can generate classical results, often with much less difficulty, and the coalescence approach is especially well suited to testing hypotheses about samples drawn from populations. PftOBLEM 7A\ The probability distribution for the number of gen- erations back to the first coalescence (in a pure drift model) in a sample of k genes taken from a hapkricl population of size N is approximately: k N Pr(Jir9t coalescence t generations ago) = xe-* where x = From this one can show that the mean number of generations back to the first coalescence is 1/*. The more genes in the sample, the more likely it will be that a coalescence occurred recently. Calculate the expected time to first coalescence in a population of N «= 450 for a sample of 10 genes. How many genes would you have to sample to halve this coalescence time? 310 Chapter 7 ANSWER The expected time to first coalescence in a population of N = 450 for a sample of 10 genes is N /i k ) = 450 /[ 10 ] = 450 /(10 x 9 / 2) = 10 generations. To determine how many genes one would have to sample to halve this coalescence time, solve for 5 = 450/ This is equivalent to 90 = ft!/[2!(fc - 2!)]. By trial and error, you will find that a sample of 14 genes will do it. Note that by increasing the sample only from 10 to 14, we expect to find a pair of alleles half as divergent from each other, SUMMARY Gene frequencies fluctuate at random in finite populations The rate at which allele frequencies change varies inversely with population size. The reason for I he inverse relationship is that the sampling variance, when two alleles are segregating in a population, is determined by the binomial sampling process, and the binomial variance is m /2N. The Wright-Fisher model extended the idea of binomial sampling over multiple generations, and much of our understanding of drift has been derived from this model. In a popula- tion in which the only force acting on gene frequencies is random drift all variation must ultimately be lost. The Wright-Fisher model shows why the probability that an allele will drift to fixation is equal lo its initial frequency in the population The diffusion approximation of the Wright-Fisher model is a second-order partial differential equation that yields the distribution Mx t) giving the number of populations with allele frequency x and time f. The diffusion approach has yielded important insights into the consequences of drift including the expected time to fixation and loss of alleles. The expect- ed lime to fixation of a newly introduced allele is AN generations, showing once again that drift happens faster in smaller populations. A useful way to think about random drift is to consider a set of subpopu- lations of the same size undergoing repeated generations of sampling and drift Within each of these subpopulations, genotypes are composed by drawing alleles at random, so that each subpopulation is always in Hardy- Weinberg equilibrium The hypothetical population composed by pooling Random Genetic Drift 311 the subpopulations will have a deficit of heterozygous because, as allele fre- quencies drift closer to fixation, the frequency ofheterozygotes declines The rate at which heterozygosity is lost in a finite population is (1 - I /IN), so that a population of size 10, say, loses 5% of its heterozygosity each generation. The allele frequencies of the subpopulations are equally likely to drift up as down, so the average allele frequency over subpopulations shows no change. Real biological populations do not precisely fit the Wright-Fisher model They generally exhibit changes in allele frequency that exceed the amount expected based on the actual population size. The usual reason for the discrep- ancy is that the drift process occurs as though there are fewer than the observed census number ol individuals. The models give better correspon- dence to reality by calculating the effective population size Several different factors that require consideration in calculating effective size were examined in this chapter, including unequal sex ratio, fluctuation in population size over generations, and the uniparental transmission of mtDNA and Y chromosomes. Mutation introduces variation into populations, and random genetic drift erodes that variation. These two forces come to a steady state predicted by population genetic models. The infinite-alleles model assumes thai each new mutation generates a novel allele. The steady-state balance between mutation and drift in the infinite-alleles model is given by the aulozygosity, F, which is also the probability that two allelels are identical by descent. Under the infi- nite-alleles model for a diploid population, F = l7(l + 6), where = 4/Vu Note that the mutation rate and population size aie confounded in this model, increasing either one will decrease the autozygosiry by the same amount. We can write the same equation in terms of heterozygosity H=1-F, giving H = 0/(1 + 0). The infinite-sites model is related to the infinite-alleles model, but more specifically states that novel mutations occur at a site along thegene that has not mutated before. (If this is true, each new mutation must also generate a novel allele.) The mfinite-sites model generates predictions about the number of segregating sites expected in a population at steady state. Here the result is that the expected number of segregating sites is E(S) = 8 ]T f=| -, where the mutation rate in this val ue of 6 is the mutation rate over the entire gene in question. Classical models of random genetic drift look forward in time, following alleles as they are lost from the population and generated anew by mutation. More recently, the coaleseent approach has been to look backward in time, starting with the observed sample of alleles, and calculating times to com- mon ancestry ol alleles. Coaleseent approaches are particularly appropriate when one wants to consider the probability that a particular observed set of molecular sequence data might have the characteiislics expected of random genetic drift Computer generation of gene trees using principles of coales- cence theory makes it easy to produce a null distribution giving the full range of outcomes expected under a drift model. 312 Chapter 7 PROBLEMS I. Suppose that in one generation, in a population of size 50, the average heterozygosity (averaged across loci) is reduced from 50 to 0.42. Is the population mating at random? 2 In how many generations will the expected heterozygosity be 5% of the initial value in a diploid randomly mating population of size of 10? Size 100? 3. A gene in one individual in a population of 24 barn cats undergoes muta- tion to a new neutral allele. What is the probability that the allele eventu- ally becomes fixed? What is the probability that it eventually becomes lost? What are the answers if the mutant gene is X-hnked and the popu- lation consists of equal numbers of males and females? 4. If an isolated population of annual alpine plants decreases in heterozy- gosity by half every 50 years because of random genetic drift, what is its effective population size 7 5. Remote Pitcairn Island in the South Pacific was settled in 1789 by Fletch- er Christian and eight fellow mutineers from HMS Bounty, along with a small number of Polynesian women. Although many descendants have left the island in the intervening years, there has been essentially no immigration. Assuming an effective size of 20 in each of the eight gener- ations since the island's settlement, what value of F ST would be expected in today's population from random genetic drift? 6. In a population of effective size N - 50, how long is required for random genetic drift to double the value of the fixation index F from 0.01 to 0.02? From 05 to 10? Assuming that F is small, how many generations are required to double the value of F in a population of effective size N? For the latter, use the approximations that j 1 - {1 /2N)]' = exp(-f /2JV) and that, when F is small (F < 0.10), ln(1 - F) = -F. 7. What is the effective population number in a population of large preda- tory cats in which each breeding male controls a harem of five females and the total population consists of 200 males and 200 females? 8. What is the effective population size of a herd of ten dairy cows and one bull? What is it for 40 cows and one bull? For 10 cows and two bulls? 9. What is the variance effective population size for an X-linked gene in a population consisting of 100 females and 10 males? In a population of 10 females and 100 males? 10. Among 100 restriction site differences in two inbred strains of the flour beetle Tribolhim that are crossed and allowed thereafter to mate at ran- dom, what number of restriction sites would be expected to remain seg- regating after 10 generations assuming an effective population size of 80 individuals? How many would be expected to remain unfixed after 50 generations? Random Genetic Drift 31 3 11. In a haploid population of constant effective size 50, what is the proba- bility that two randomly drawn alleles shared a common ancestor exact- ly 100 generations ago 7 12 Employing the infinite-sites model, if = 10, how many segregating sites will one expect to find in a sample of size 10? 20? 50? 13. Consider an isolated island population with no migration, effective pop- ulation size of 250,000, and a mutation rate of 10"*. Calculate the expect- ed heterozygosity under the infinite-alleles model How much migration is necessary to increase H to 2 /, 7 14. In a haploid population of effective size 50, how large a sample must one take to yield an expected mean coalescence time of 10 generations? 15. Show that random genetic drift requires an average of t = 2N \nx genera- tions to reduce the heterozygosity from H to H /x. 16 Use Equation 7.15 to show that approximately 2/V generations of random genetic drift are required to reduce the number of segregating genes by a factor of e (e = 2.71828 . . .), given initial allele frequencies close to 0.5. 17. A set of six to eight oocytes from each of three women undergoing in vitro fertilization (IVF) were recently tested for heteroplasmy (presence of more than one mitochondrial DNA type within each cell). The mtDNA from eggs of two women were all identical and matched that of somatic cells with no heteroplasmy, but the other woman produced eggs with two different mtDNA types. Densitometry scans allowed investigators to determine that the individual cells had relative frequencies of the two mtDNA types ranging from 20% to 50%. Assuming 30 cell generations from zygote to zygote in the maternal germline, and N = 1000 mitochon- dria per cell, what do you conclude from these observations? Are they consistent with neutral sampling of mtDNA types? CHAPTER 8 Molecular Population Genetics Molecular Clock - Synonymous and Nonsynonymous Substitution Codon Bias Gene Genealogies Organelle DNA Molecular Phylogenetics HLL THE forces in population genetics have an impact on the pattern of variation seen in molecular sequences of genes, includ- ing mutation, migration, selection, and random drift. A primary focus of molecular population genetics is to make inferences about the con- tribution of each of these evolutionary forces to produce the patterns of mol- ecular sequence variation we see today. Usually this process involves a close interplay between mathematical model building, statistical parameter esti- mation, and experimental observation. Several times in the past, unexpected patterns of sequence variation have arisen which, in turn, gave rise to whole new avenues of theoretical inquiry. In many cases, inferences about evolu- tionary forces transcend species boundaries by making use of data on both within-species polymorphism and between-species divergence. The genetic basis for species isolation is itself amenable to analysis. But first let us begin with the basic theoretical principles that underlie molecular population genetics. THE NEUTRAL THEORY AND MOLECULAR EVOLUTION The first systematic application of protein electrophoretic methods to popu- lation genetics revealed extensive genetic variation within most natural pop- ulations. Typically, 15 to 50% of the genes coding for enzymes were observed to include two or more widespread, polymorphic alleles. The polymorphic alleles occurred with frequencies considered to be too high to result from 315 316 Chapter 8 equilibrium between adverse selection and mutation Motoo Kimura sug- gested lh.it most polymorphisms observed at the molecular level are selec- tively neutral, so that their frequency dynamics m a population are deter- mined by random genetic drift (Kimura 1968) By extension, the hypothesis of selective neutrality would also apply to most nucleotide or amino acid substitutions that occur within a molecule during the course of evolution. The neutral theory has been of great importance in population genetics in stimulating the collection and analysis of data in attempts to evaluate its ade- quacy. Mathematical investigations of its implications have resulted in one of the most complete and elegant theories in all of biology Tests of the corre- spondence of sample data to the neutral theory are almost universally low in power, which means that large sets of data are needed before one has a rea- sonable chance of rejecting neutrality. The recent trend has been that more and more cases of departures from neutrality are being found, in part because of the expansion in available data and in part because of the increas- ing subtlety of tests that are applied Regardless of the action of other forces shaping molecular sequence variation in populations, the force of random drift is always there, and for this reason the neutral theory remains useful in generating rigorous null hypotheses The next section summarizes some of the theoretical implications of the neutral theory and some of the data bear- ing on it. Theoretical Principles of the Neutral Theory The neutral theory models the fate of mutations that are so nearly selective- ly neutral in their effects that their fate is determined largely through ran- dom genetic drift A variety of mutation models have been considered, including infinite-alleles, infinite-sites, and finite-sites models. In all models, though, random drift occurs when N adult individuals produce an infinite pool of gametes from which 2N are chosen at random to create the N zygotes of the next generation. Much of the complexity of the mathematics of the neutral theory arises from the fact that the mutational histories of alleles are not independent, because they share an overlapping genealogical history. Before we get into the details of the predictions of the neutral theory, let us first review some of the theory's principal implications (Kimura 1983). I Tl a population contains a neutral allele with allele frequency p Ql then the probability that the allele eventually becomes fixed equals p n . In particu- lar, a newly arising neutral mutation occurs in just one copy so the initial allele frequency is p a - 1 /IN, and the probability of eventual fixation of the mutation is therefore 1 /2N. Figure 8.1 shows that a mutant allele aris- ing in a smaller population has higher chance of fixation. 2. The steady-stale rate at which neutral mutations are fixed in a population equals \x, where u is the neutral mutation rate. It is noteworthy that the equilibrium rate of fixation does not involve the population size N. The Molecular Population Genetics 3 1 7 Time Figure 8 1 Diagram showing the trajectory of neutral alleles in a population , ?/S S w tCT * e P°P u,aliim b V mutation and have an initial allele frequen- cy of 1 f2N. Most alleles are lost, but those that go to fixation take an average of 4N generations. The time between successive fixations of neutral alleles is 1/u generations (A) A moderate size population. (B) The same population size- a higher mutation rate gives the same time to fixation, but less time between fixa- tions. (C A smaller population has alleles that go to fixation more rapidly but the time between fixations is still 1 /p. (After Kimura 1980 ) reason is thai the N cancels out: The overall rate is determined by the product of the probability of fixation of new neutral mutations ( 1 /IN) and the average number of new neutral mutations in each generation (2A/p), hence (1 /IN) x (2Nu) = p. 3. The average time that occurs between consecutive neutral substitutions equals 1/u. This principle lollows directly from the one above. If the steady-state rate of fixation is u per unit time, the average length of time 318 Chapters between substitutions will be the reciprocal, or 1 /u. By way of analogy, if a Swiss clock cuckoos at the rate of 24 times per day, then the average length of time between cuckoos is 1 /24th ol a day, or one hour. As Figure 8.1 shows, the time interval between fixations is independent of popula- tion size, and elevating the mutation rate decreases the time interval between fixations. PROB LEM 8. 1 The neutral theory makes a strong prediction about the relationship between population si2e and heterozygosity. Under the infinite-alleles model, we can express the prediction by the for- mula, H = 4Nu/(4Np+l), and hence small populations should have low heterozygosity and large populations high heterozygosity. Do the data support this prediction? A survey of 77 species reviewed by Nei and Graur {1984} found that species with very small populations (less than, say, 10*) had a mean protein heterozygosity of 0.05, whereas those species with a very large population (greater than 10 , say, which include Dwsophila species), have heterozygosities of around 0.2. This positive correlation seems to favor the neutral theory, except that the range of H is much smaller than theory predicts in view of the enormous range in N. When these extremes of population size ate excluded (W < 10 4 and N > 10 9 ), there is no significant correlation between population size and heterozygosity. What is going on? ANSWER The paradoxical result demonstrates that levels of vari- ability in a population are determined by several forces, arid that dif- ferent organisms may be affected by the forces to different magnitudes. The result does not support the neutral theory insofar as it shows that population size does not, by itself, explain levels of vari- ation. On the other hand, the discrepancy is not grounds to complete- ly toss out the neutral theory. For one thing, the population sizes were generally roughly estimated, and effective sizes (Chapter 7), which were not estimated, are more relevant to neutral predictions of het- erozygosity There is also an implicit assumption that mutation rates are identical in all organisms, and violations of this assumption can be found. 4. Analysis of the diffusion equation has shown that, among newly arising neutral alleles that are destined to be fixed, the average time to fixation is 4N,. generations (where N c is the effective population size). This too is evi- Molecular Population Genetics 319 dent in Figure 8.1: alleles that go to fixation do so in less time m the smaller population. Among newly arising neutral alleles destined to be tost, the average time to loss is (2N P /N)ln(2N) generations. The average times required for fixation or loss af*ply to newly arising alleles, which are necessarily present in just one copy, so p = l'/2N. The implication of these formulas is that, on average, neutral mutations that are going to be fixed require a very long time for this to occur, but mutations destined to be lost are lost quite rapidly. 5. If each neutral mutation creates an allele that is different from all others existing in the population in which it occurs, then, at equilibrium, when the average number of new alleles gained through mutation is exactly off- set by the average number lost through random genetic drift, the expected homozygosity equals l/(4W,u + 1), where li is the neutral mutation rate. The model of mutation in which each new allele is novel is the infinite- alleles model of mutation. The quantity 4W,p, which shows up frequently in the neutral theory, is often denoted as 9. The equilibrium average homozygosity is therefore 1 /(I + 0). Since the heterozygosity equals one minus the homozygosity, the average heterozygosity at equilibrium in the infinite-alleles model equals 9/(1 + 9). Larger populations are expected to have a higher heterozygosity, as reflected in the greater number of alleles segregating at any one time in the larger populations in Figure 8.2. I25r g 15 05 4 6 8 10 Logarithm of population si?f Figure 8.2 Given the enormous variation in effective population sizes, one would expect to see a wider range in variation in heterozygosity than ^'actually observed. The relation between population size and heterozygosity does not fit the neutral theory expectation over a wide range of intermediate population sizes. (After Nei and Graur 1984.) 320 Chapter 8 ESTIMATING RATES OF MOLECULAR SEQUENCE DIVERGENCE Rates of Amino Acid Replacement The initial impetus for Ihc neutral theory came from observations on the rate of amino acid replacements in proteins. When extrapolated to the entire genome, the inferred rate of evolution was several nucleotide substitutions per year. This rate was regarded as much too high to result from natural selection, because the intensity of selection must be limited by the total amount of differential survival and reproduction that occurs in the organ- ism. Direct DNA sequencing later revealed that rates of nucleotide substitu- tion vary according to the function (or presumed absence of function) of the nucleotides. The type of data that must be analyzed are best illustrated by example The first 18 amino acids present at the amino terminal end of the human and mouse y-interferon proteins constitute a signal peptide that is used in secretion of the molecules {Gray and Goeddel 1983). The sequences are: Human Met Lys Try Thr Mouse Mel Asn Ala Thr Tyr lie Cys He Leu Ala Ph<> Cln Leu Cys lie Val Leu Cly Ser Leu Ala Leu Cln leu Phf? Leu Mel Ala Val Ser In order to calculate the proportion of amino acids that differ in the two signal sequences, we can simply count the number of sites that are the same and the number of sites that differ. Among the 18 amino acids there are 10 differences, so the proportion different is 10/ 18 = 0.56. To interpret these data, let us suppose that amino acid replacements occur at the rate X per unit time. Consider two independently evolving sequences, initially identical, which at time f are found to differ in the proportion D, of their amino acids. After the next time interval, the proportion of differences D,,i is given by D ftl = (l-D f )(2X) + D ( 8.1 In this equation, (1 - D,)(2X) is Ihe proportion of sites, previously identi- cal, in which one or the other underwent an amino acid replacement during the time interval in question, which must be added to the already existing differences D, in order to give the total. (The equation ignores the unlikely possibility of an amino acid replacement making two previously different amino acid sites identical.) The factor of 2 is present because the total time for evolution is It units (f units in each lineage after the split), which is illustrat- ed in Figure 8.3. Equation 8.1 suggests the differential equation which has the solution dD/di = D M -D, = 2X- 2XD t D t = l-e~ 83 Molecular Population Genetics 321 rm|ii>iln>n ol diflnri-ke-. O \ toUil k-nglhr cr.mccacccQ OG30GOOOQ> Total time = 2) ^COOOOC#COO Figure 8.3 Two amino acid or nucleotide sequences that have each undergone independent evolution from a common ancestor for f time units are separated by a total time of 2r units because there are t units in each lineage after the split The proportion of sites that differ in the sequence is denoted D and Hie total number of sites L. fn this particular example, L = 10 and D = 3/10 An alternative argument can be used to derive Equation 8 3 without resorting to differential equations. If X is the rate of amino acid replacement per unit time, then the probability that a particular site remains unsubsti- tuted for f consecutive intervals along each of two independent lineages is (1 - X) , which is approximately equal to e' 111 , provided that Xt is not too large. Thus, the probability D, of one or more replacements occurring in t units of time after divergence is approximately 1 - c~ 7U r which is Equation 8.3. Since X is the rate of amino acid replacement per unit time, the expected proportion of differences between two sequences at any time t is K = 2Xt 8.4 where the factor of 2 is again present because the total time for evolution is It units (Figure 8.3). Substituting K from (8.4) into (8.3) and rearranging vields the following estimated of K, B fc = -\n(\-D) 8.5 where D is the observed proportion of sites in which two sequences differ. If the sequences under comparison are L amino acids in length, then the esti- mated variance Var(K) of K is estimated from the distribution of X implied by the substitution process and is approximately Var(K) = D/f(l - D)L] 86 322 Chapter 8 The rate of evolution at the molecular level is given by the amount of sequence divergence that occurs per unit of lime. Thus, as suggested by Equations 8.4 and 8.5, if two sequences are compared, and these are known to have diverged from a common ancestral sequence an estimated t time units ago, then the rate of evolution k may be estimated as i = k/2t 8.7 The units of X are usually expressed as replacements per amino acid site (or substitutions per nucleotide site) per year. The quantity K is used in preference to D in estimating the rate of molec- ular evolution because K takes multiple substitutions into account. Over long periods of evolutionary time, the amino acid present at a particular site may be replaced several times, first by one alternative, then by another, then still another, and perhaps, at some stage, even return to the amino acid originally present at the site. When comparing two sequences, only the sites that are dif- ferent can be identified. Sites that are identical at the present time may include some that were different in the past, and sites that are different at the present time might have undergone more than one substitution. The quanti- ty D is determined only by the proportion of differences between the sequences observed at the present time. The estimate K makes a correction for multiple substitutions, but at Ihe cost of introducing assumptions that the substitutions occur independently and at the same rale through time. For relatively short intervals of evolutionary time, during which multiple substitutions remain uncommon, the correction is minor, and the value of K is close to that of 6. This can be seen by the fact that the initial slope of the curve plotted in Figure 8.4 is 1 . As the observed sequence divergence increas- es, it becomes more likely that multiple hits have happened, so the slope decreases. Over longer intervals, when many multiple substitutions have occurred, the correction is important, and the assumptions on which it is based must be evaluated critically. Correction for multiple substitution events is even more important for nucleotides than it is for amino acids. With amino acids, the probability of a random replacement returning an amino acid site to its original identity is '/so (assuming equal frequencies), whereas for nucleotides it is '/). PROBLEM 8.2 Use the data in the preceding example to estimate the average rate of amino acid replacement in the signal peptide of •y-interferon during the divergence of mice and humans. Based on fossil evidence, the separation of these species occurred approxi- mately 80 million years ago. Molecular Population Genetics 323 Substitutions per site, K Figure 8.4 As sequences become more divergent over time, the number of substitutions per site (K) can continue to increase, but the proportion of sites that mismatch in the observed sequences (D) saturates. ANSWER For the signal peptide, D = 0.56 and K = -ln(l - 0.56) = 0.82. The estimated rate of evolution is therefore 0.82/ [2 x (80 x 10")] = 5.1 x 10" 9 amino acid replacements per amino acid site per year. The standard deviation of K is estimated as equal to [0.56/(0.44 x 18)] 1/2 = 0.27. With such a small sample size, the estimates could ordinarily not be taken too literally. However, in this case, the average rate for the signal sequence is very close to the average rate for the molecule as a whole. For y-interferon, among 155 amino acid sites there are 91 dif- ferences, giving K a 0.88 ± 0.22 and an average rate of 5.5 x 10"* amino acid replacements per amino acid site per year. Rates of amino acid replacement vary over a 500-fold range in different proteins. The rate of amino acid replacement in y-interferon is one of the fastest rates known (Li et all 1985). Among the slowest rates is that of histone H4, for which % = 0.01 x 10" 4 per year. The average rale among a large num- ber of proteins is very close to the rate found in hemoglobin, which is approx- imately 1 x 10 " 9 amino acid replacements per amino acid site per year To be concrete about the interpretation of the rate of amino acid replace- ment, consider a protein exactly 100 amino acids in length, m which the rate 324 Chapter 8 of amino acid replacement per amino acid site equals 1 .0 x 1 Q per year. For the entire protein, the rate of replacement equals 100 x 1.0 x 10" 9 = 1 x 10" per year. In two different species, therefore, the protein would accumulate amino acid differences at the rate of one replacement every 5 million years since their divergence from a common ancestor [because (5 x 10 ) x 2 x (1xl(r 7 ) = 10]. The simple model that we just examined makes an assumption that is violated by an abundance of data. We assumed that all amino acid replace- ments occur with equal likelihood Besides the fact that real proteins violate this assumption, we might not have expected it to be true, since some amino acid changes require a single underlying nucleotide change, while others require two or even three changes. More sophisticated models for amino acid sequence evolution account for these differences by weighting amino acid changes with their observed rates of change (Dayhoff 1972; Jones et al. 1992). Rates of Nucleotide Substitution Nucleotide sequences are analyzed in the same manner as amino acid sequences, but the analogous equation to (8 1) is slightly more complicated because it has to correct for cases in which a substitution makes two previ- ously different nucleotide sites identical. The correction is significant for nucleotide sequences because an expected one third of random substitutions will make two previously different nucleotides identical. The correction is usually unnecessary for proteins because only V, P of random replacements make two previously different amino acids identical. Several models of nucleotide substitution have been studied, which differ primarily in the assumptions about rates of mutation between pairs of nucleotides. The simplest model is one in which mutation occurs at a con- stant rate, and each nucleotide is equally likely to mutate to any other (Jukes and Cantor 1969). If a is the rate of mutating from one nucleotide to a differ- ent nucleotide, then in any time interval, A mutates to C with probability a, A mutates to T with probability a, and A mutates to G with probability a. The probability that A does not mutate in this interval is therefore 1 - 3a. The probability that a particular site is A at time r + 1 is P* i+] ) = 0-3a)P*„ + a(l-P>, ( ,)) 8.8 because the first part of the equation gives the probability of having been A at time t and not mutating, and the second part is the probability of being any other nucleotide and mutating to A. From 8.8 it follows that Paim) ~ Pa\d = dP Mi) /dt = -4c* P m + a Solving this differential equation, P/,m = '/♦ + % ^ B ' 8.9 8.10 Molecular Population Genetics 325 assuming that the initial state was A This is the transition probability from A to A, which we can write as P M . If vve observe two sequences that have been separated for time f, then the probability that thev continue to carry the same nucleotide at a particular site is Paa = '/< + % e 8.11 because 2r is the total duration of time along both lineages during which changes could occur. Let d be the proportion of nucleotide sites that differ between two sequences: d = l-P AA d=%0-e~* M ) 8.12 8.13 In the previous symbols, I is the rate of mutation to a nucleotide differ- ent from the current nucleotide, so relating this to <x, we have X = 3a. This implies that k = 2Xt = 2(3o/) = bat. Taking logarithms of both sides of Equa- tion 8.13, we deduce and, since k = V 4 (8af), 8a/=-ln(l-4rf/3) £=-y 4 In (1-4,1/3) 8.14 8.15 where k is the expected proportion of nucleotide sites that differ between two sequences at a time I units after their evolutionary separation. By analogy with protein evolution, D is the observed proportion of L nucleotide sites in which the sequences differ. The variance Var ( it) of the estimate can be esti- mated as Var(k)=d(\-d)/[L(l-4d/3f 8.16 Figure 8.5 shows the relationship between time and d, and shows that nucle- otide sequences that follow the Jukes-Cantor pattern of mutation (all nucleotides equally interchangeable) approach an asymptote, showing a divergence of %. This makes intuitive sense because, after sufficient time, the common ancestry of the sequences has been erased, and % of the sites will match by chance. PROBLEM 8.3 The coding region of the irpA genes in strains of the related enteric bacteria Escherichia coli strain K12 and Salmonella typhimurium strain LT-2 were sequenced and compared (Nichols and 326 Chapters Figure 8.5 Simulations of the substitution process for nucleotide sequences show that the sequence divergence saturates at d = 0.75. The jagged lines are numerical simulations of a sequence of length 1000, and the dots give the pre- diction under the Jukes-Cantor model Yanofsky 1979). The trpA gene codes for one of the subunits of" the enzyme trypto- phan synthetase used in the synthesis of tryptophan. Estimate the amount of nucle- otide divergence It and amino acid divergence K and their standard deviations. K12 CTC CCA CCT ATC TTC ATC TGC CCG CCA AAT OCC OAT GAC GAC CTC CTG CCC CAC ATA CCC Vat Ala Pro lie Phe He Cys Pro Pro Asn Ala Asp Asp Asp leu Uu Arg Cto «• Ate LT2: A1C CCC CCC ATC TTC ATC TCC CCC CCA AAT CCC CAT CAC CAT CTT CTG CCC CAC GTC CCA lie Ala Pro lie Phe lie Cys Pro Pro Asn Ala Asp Asp Asp Uw lau Arg Cln Val Ala ANSWER For the amino acid sequences, L = 20 and D - 2/20 = 0.10; thus K = -ln(0.90) = 0.105 with standard deviation 0.074. For the nucleotide sequences, I - 60 and d = 9/60 = 0.15; thus ft = -3/ 4 ln(0.8) = 0.167 with standard deviation 0.058. Assuming that Escherichia and Salmonella diverged at around the time of the mam- malian radiation 80 million years ago, the rates of evolution are 0.167/(2 x 80 x 10 6 ) = 1.04 x 10'" nucleotide substitutions per year and 0.105/(2 x 80 x 10 6 ) = 0.66 x 10" amino acid replacements per year. In the gene as a whole, the values are ft = 0.300 for nucleotide substitutions and K = 0.162 for amino acid replacements. Molecular Population Genetics 527 The Jukes-Cantor model assumes that all possible nucleotide changes occur at an equal rate. In fact, it is generally observed from sequence com- parisons that transitions, or changes either from purine to purine (G<=> A) or from pyrimidine to pyrimidine (C<s=>T) are more frequent that transversions (the other possible changes). Kimura (1980a) sought to accommodate this observation by making a model with two mutation-rate parameters Transi- tions occur with rate a and transversions occur with rate p. The rate matrix below shows the parameters of the Kimura two-parameter model; as you might guess, other models can also be specified by adding parameters to this table These models can be fitted to the data in a variety of ways, includ- ing solutions of the sort we derived for the Jukes-Cantor model, as well as with more complex numerical methods. Rate matrix for the Kimura two-parameter model: Ending base A C C T A - P a P C P - P a G a P - P T P a P - Usually we have data on more than two sequences, and estimates of the parameters of the substitution models are sought. If the phylogeny of the organisms in the data set is known, it is possible to calculate the like- lihood of the observed sequences given the phylogeny and the parameters in the model (Felsenstein 1981). Many advances have been made in recent years in applying the method of maximum likelihood for estimating para- meters of the substitution process in this context (Goldman 1993; Yang 1996a). Other Measures of Molecular Divergence Rates of evolution and divergence times can be estimated from other kinds of molecular data if care is taken to consider carefully how the process of mutations results in differences in the data that are actually scored. For example, Randomly Amplified Polymorphic DNA (RAPD), which is ana- lyzed by the polymerase chain reaction, can be used to estimate nucleotide divergence only if one can verify some questionable assumptions about how PCR reactions work (Clark and Lanigan 1993). VNTR loci (loci that are poly- morphic due to variable numbers of tandem repeats generated by unequal exchanges) have a forward and back mutation patlern that results in very different population dynamics. In a similar manner, microsatellites, also 328 Chapter 8 known as STRPs (short tandem repeat polymorphisms), undergo increases and decreases in copy number such that small changes in copy number are more common than large changes. The result is a sort of stepwise mutation process, and models of this have yielded predictions about patterns of microsatellite variability that are roughly concordant with observations (Zhivotovsky and Feldman 1995). THE MOLECULAR CLOCK Although the rate of nucleotide substitution and amino acid replacement varies among different genes, the average rate of molecular evolution can be rather uniform throughout long periods of evolutionary time. Such unifor- mity in the rate of amino acid replacement or nucleotide substitution, first noted by Zuckerkandl and Pauling (1962), is known as a molecular clock. An example of the approximate uniformity in amino acid substitutions is illustrated in the evolution of the a-globin gene in the organisms depicted in the phylogenetic tree in Figure 8.6. The data are summarized in Table 8.1. The numbers above the diagonal are the percent amino acid differences (D x 100) between the a-globin sequences. For example, the a-globin genes of dog and human differ in 16.3% of their amino acid sites; since mammalian a- globin contains 141 amino acids, this percentage corresponds to 23 sites in which the amino acids differ. The percentages exclude differences that result from the insertion or deletion of amino acids, which are called gaps in sequence comparisons. For example, the comparison between human and shark a-globin is based on 139 amino acid sites that are homologous, and excludes gaps amounting to 11 additional amino acid sites. Missing from Fig- ure 8.6 are plants, which (remarkably) also have sequences, known as leghe- moglobin, that show significant homology to vertebrate globins (Landsmann etal. 1986). Beneath the diagonal in Table 8.1 are the estimated proportions of dif- ferences per amino acid site, calculated from Equation 8.5 as K = -In(l - D). The table also gives the average value of K in all comparisons with the shark, carp, newt, chicken, echidna, kangaroo, and dog, respectively, and the diver- gence times from the bifurcations in Figure 8.6. The average proportion of differences per site is plotted against diver- gence time in Figure 8 7. The very close fit to a straight line is evident Since the divergence lime is exactly half of the total time available for evolution (Figure 8.6), the rate of evolution A. can be estimated as one-half times the slope of the line in Figure 8.7. For these data, the slope is 1.8 x 10 9 , and there- fore K = 0.9 x 10 4 amino acid replacements per amino acid site per year. The good fit of the points to the straight line indicates that the actual rate of a- globin evolution has deviated little from the average for the past 450 million years. Molecular Population Genetics 329 Shark Carp Newt „, , Echidna „ Chicken Kanpiin Figure 8.6 Phylogenetic relationships among eight vertebrate species ,md their approximate times of evolutionary divergence (From Kimura 1983 ) 330 Chapter 8 TABLE 8.1 RATE OF EVOLUTION IN THE a-GLOBIN GENE Shark Carp Newt Chicken Echidna Kang Dog Human Shark 59 4 614 59 7 60-4 55.4 56.8 53.2 Carp New! 0.90 53.2 51.4 53.6 50.7 47.9 48 6 0.95 0.76 44 7 50.4 47 5 46.1 44.0 Chicken 0.91 0.72 0.59 34.0 29.1 31.2 24 8 Echidna 0.93 0.77 0.70 0.42 34.8 29.8 26.2 Kang Dog 0.81 0.71 0.64 0.34 0.43 23.4 191 0.84 0.65 0.62 0.37 0.35 0.27 16.3 Human 0.7fe 0.67 0.58 0.28 0.30 0.21 .018 AvgK Time 0.87 0.71 63 35 0.36 24 0.18 450 410 360 290 225 135 80 (Percentage data from Kimura 1983.) Notr Values above the diagonal are the observed percent amino acid differences (D) between the a-globin sequences in the species, values in boldface are the expected amino acid differences per site [K = -ln(l - D)j. Average values of K and the esti- mated times of divergence (in millions of years) are given at the bottom of the table Abbreviation Kang, kangaroo 100 200 300 400 Time (millions of years) Figure 8 7 Relation between estimated number of amino acid substitutions in oe-dobin (K) between pairs of the vertebrate species in Figure 8.6, against time since each pair diverged from a common ancestor. The straight line is expected based on a uniform rate of ammo acid substitution during the entire period. (From Kimura 1983.) Molecular Population Genetics 3 31 PROBLEM 8.4 The fJ-globirt molecule in primates contains 146 amino acids, and estimates of the number of amino acid differences among various primates are tabulated below (data from Kimura 1983). Calculate the average rate of evolution of p-globin molecule in primates. {Hint; First calculate D and K for each species pair, then plot the points with time on the x axis and D on the y axis. Finally, do a lin- ear regression to estimate the average rate of substitution.) Time of divergence Average number of (mlHhm of years) amino add differences 85 25.5 60 24.0 42 6.25 40 6.0 30 2.5 15 1.0 AN SWEft D values are obtained by dividing each number of amino acid differences by 146, and average values of K are estimated as -ln(l - D ). The average K values, from top to bottom, are 0.192, 0.180, 0-044, 0.042, 0.018, 0,007, respectively. These are the y values in the lin- ear regression, and the x values are the divergence times. Altogether there are n = 6 points. In this case, X(xy) = 3.1263 x 10 7 , E(x) = 2.72 x 10 s , %) - 0.482, and K* 2 ) - 1.5314 x 10 16 . The slope of the regression is 3.15 x 10" 9 , and the rate of evolution is half of this, or 1.58 x lO -9 amino acid replacements per amino acid site per year. This estimate is reasonably close to the value of 0.9 x 10"* per year calculated for oc- globirt. (Note: Rather than calculate K from the average number of amino acid differences, it Would be more accurate to calculate K for each species comparison and then take the average; however, in this example, it makes very little difference.) Variation across Genes in the Rate of the Molecular Clock If an organism has a particular rate of mutation in its genome, one might think at first that the rate at which the molecular clock runs would be the same for all genes. But the neutral theory predicts that the rate of molecular evolution should depend on the neutral mutation rate, which may be quite a bit lower than the overall mutation rate, and may vary widely across genes. Figure 8.8 shows that three different proteins in the same organisms have 332 Chapter 8 2(10 3(10 400 500 600 700 800 900 Millions of years since divergence 1000 1100 1200 1300 1400 Figure 8.8 The molecular clock runs at different rates in different proteins. One reason is that the neutral substitution rate differs among proteins. Fibrino- gen appears to be relatively unconstrained and has a high neutral substitution rate, while cytochrome r has a lower neutral substitution rate, and may be more constrained. Data are from a wide variety of organisms (From Dickerson 1 971 .) widely differing molecular clock rates. Nevertheless, within each gene, we observe reasonably uniform rates of change. The variation across genes appears to be due to the fact that some proteins are highly tolerant of substi- tutions, whereas others suffer deleterious effects from even one or a few minor changes. Genes whose function is well buffered from the environment generally have a slower rate of substitution than genes whose products have a premium on variability. The extremes are represented by histone H4, at the low end, and y-interferon, at the high end, with globin proteins near the mid- Molecular Population Genetics 333 die of the spectrum, In short, the molecular clocks for different genes "tick" at different rates. In addition to functional constraints affecting substitution rate, the pat- tern of hereditary transmission also affects substitution rate. Organelle genomes are replicated and transmitted in a manner distinct from nuclear genes, so it may not be surprising that they undergo substitutions with dif- ferent dynamics. Mitochondrial DNA exhibits wide variation in substitution rates across its relatively tiny genome, but in animals the substitution rate is generally much higher than the substitution rate of chromosomal genes. In plants, on the other hand, comparisons among nucleotide substitution rates of nuclear DNA, chloroplasts, and mitochondria reveal clear differences, with mtDNA showing less than one-third the substitution rate of chloroplast DNA, which in turn has about half the substitution rate of nuclear genes (Wolfe et al. 1987). In general, genes on the X chromosome have a lower rate of substitution than do genes on autosomes (Miyata et al. 1987). A higher rate of mutation in males (Shimmin et al. 1993) would lower the X-chromosome rate because the X chromosome spends more time in females. But the substi- tution rate for Y-linked genes is indistinguishable from that of autosomes (McVean and Hurst 1997), which suggests that the mutation rate is equal in both sexes but lower in X-linked genes than in autosomal genes. Not only do substitution rates vary from one gene to another, but they also vary widely across sites within each gene! If all sites did undergo sub- stitution at the same rate, then the number of substitutions per site should have a Poisson distribution. Fitch and IvTargoliash (1967) noticed that the cytochrome c data did not fit this model unless invariant and hypervariable sites were excluded. The models that we have developed so far assume that all sites evolve in the same way, so to accommodate this variability (and to test for how different the rates are) models that specifically incorporate rate variation must be developed. One convenient model is to assume that the rates vary according to a gamma distribution (Golding 1983; Wakeley 1993) Yang (1996b) reviews estimates of the rate-variation parameter of the gamma distribution, and finds that all 17 cases examined show significant among- site variation in substitution rate. Variation across Lineages in Clock Rate The neutral theory predicts that the rate of the molecular clock should run at different rates for different organisms having different neutral mutation rates. The range of mutation rates is impressive. Figure 8.9 shows the num- ber of nucleotide differences observed in the influenza MS genes, plotted against the year of isolation of the virus containing them. The rate of gene substitution averages X = 1.94 ± 0.09 x 10" 1 nucleotide substitutions per nucleotide site per year. Although the rate of gene substitution is about 10 -fold faster than observed in germline genes in eukaryotes, it is neverthe- 334 Chapters 70 10 1930 1950 r 1970 Year of isolation Figure 8.9 Molecular evolution in the N$ genes of influenza virus determined from strains isolated and stored during the past 60 years. The total rate of evolu- tion in the 890-nuclcotide sequence averages 1 .73 + 0.08 nucleotide substitutions per year, and the rate is remarkably uniform (From Buonagurio el al. 1986.) less approximately constant during the period available for study. The extra- ordinary rate of evolution in influenza virus is thought to be related to a high rate of spontaneous mutation resulting from errors in replication (Holland et al. 1982). As in many other RNA-based viruses, the RNA replicase enzyme that replicates the influenza genome lacks a proofreading function. Rapid rates of gene substitution can be of immense medical significance. Yokoyama el al (1988) estimated the rate of substitution in the po! gene of the human immunodeficiency virus as 0.5 x If)" -1 per nucleotide site per year. The time of divergence between MI VI and HIV2 was estimated at just 200 years ago, and the bulk of the genetic variability among recently isolated strains of HIV1 has been generated in the last 20 years. The rate of the molecular clock also varies among taxonomic groups (Brit- ten 1986). For example, the insulin gene evolved much more rapidly in the evolutionary line leading to the guinea pig than in other evolutionary lines (King and Jukes 1969), and the C-type viral sequences integrated into the pri- mate genome evolved at twice the rate in Asian primates as in African pri- mates (Benveniste 1985). Figure 8.10 illustrates another example of a retardation in the clock in one lineage Such departures Irom constancy ol the clock rale pose a problem in using molecular divergence to date the times of Molecular Population Genetics 335 — D jrillNinti rD mntmda "If D. pvii(hn)i>S( urn D fieiswnli" D mnlv^iui D. gwuu he D madciwisp; D incnn " D erccln i D trinket i 1 — D, i/akulm I — D mi'lanognstcr I — D mauitlimm I — j- D tiimiliiti'. 0.1 O sn hell <n Figure 8.1 Gene genealogy of Drosophila Adh sequences showing a signifi- cant slow-down of substitutions in the pseudoobsana clade (After Takezaki et al. existence of most recent common ancestors. Before this inference can be jus- tified, one needs to know that the set of species one is examining have a uni- form clock. PROBLEM 8.S The simplest way to test whether substitutions have occurred at the same rate in different organisms is to consider a tree like that in Figure 8.11. We expect that the divergence between A and C should be the same as the divergence between B and C if the clock is uniform on all branches. Tests of this hypothesis are known as rela- tive rate tests. Any site that underwent a substitution along the branch from X to C (but not on the other branches) will have the prop- erty that A=B*C. Sites that underwent a substitution on the branch from X to B (but not the other branches) will show A= C * B. Tajima (1993) showed that a simple and robust relative rate test could be per- formed by simply doing a chi-square test of the null hypothesis that the numbers of these two kinds of sites are equal. Suppose we observe sequences as follows: A ATG CTA GCA TGC ATG CTA GC B ATC CTA GCA TCC ATG GTA GT C ATG CTA TCA TGC TTG GTA GC 336 Chapter 8 Figure 8.11 A .simple tree for illustrating the relative rate test of Tajima (1993). Calculate the observed and expected numbers of sites in the two cat- egories (A= B* C and A= C *B), and calculate the chi-square statistic to determine whether they are equal. ANSWER The observed number of sites for which A= B *C is 2, and for A = C * B there are 3 sites. Sites where A= B - C or A * B * C are ignored in this test. The expected number of sites of the two types is each (2 + 3)/2, so the chi-square tests gives (2 - 2.5) /2.5 + (3 - 2.5) 2 /2.5 = 0.2, which is clearly not significant. This example had insufficient data for an adequate test, but it provides an example in which there is no evidence for significant difference in rates. A more flexible but more involved test, based on maximum likelihood, can be found in Muse and Weir (1992). The Generation- Time Effect One observed feature of molecular evolutionary clocks is that their rate is approximately constant in a time scale measured in years. This is quite unex- pected because mutation rates are thought to be more nearly constant when measured in generations. However, the appropriate time scale of molecular evolution is not completely settled (Easteal 1985), as there is some evidence Molecular Population Genetics 337 thai the rale of synonymous substitution in genes in the rodent lineage (short generation time) might be about two times as rapid as occurs in the same genes in the human lineage (Wu and Li 1985; Li and Wu 1987). Evidence from immunoglobulin genes further suggests that among mammals, the pri- mate lineage has the slowest rate of nucleotide substitution (Snkoyama et al 1987) Even if true, a nearly constant rate of gene substitution per year is not nec- essarily in conflict with a constant rale of neutral mutation per generation. The reason is that organisms with short generation times tend to be small and to maintain large population sizes. In such organisms, the proportion of nearly neutral mutations will be reduced because effective neutrality requires that Ns « 1, where s is the selection coefficient against the mutation. How- ever, the smaller proportion of nearly neutral mutations in these organisms is offset against the occurrence of more mutations per unit time than in larger organisms, because the generation time is shorter. Thus, the effects of short generation time and larger population size act in opposite directions and tend to cancel out (Crow 1985). Does the Constancy of Substitution Rates Prove the Neutral Theory? The possibility that gene substitutions might occur at an approximately con- stant rate gave some credence to the simplest version ot the neutral theory. Theoretical principle 2, discussed earlier in this chapter, states that the expected rate of substitution of neutral alleles equals the rate of mutation u to neutral alleles. Therefore, on the face of it, the occurrence of molecular clocks would seem to support the neutral theory. But when we dig a bit deeper into the predictions of the molecular clock, we find that things are not necessarily so simple. In a theoretically perfect molecular clock driven by a random process identical to that of radioactive decay (a Poisson process), the variance in the rate of ticking would be equal to the average rate of ticking. Tests based on the number of substitutions between pairs of species in three proteins showed that the variance was significantly larger than the mean (Ohta and Kimura 1971). Langley and Fitch (1974) backed this up by an analysis in which they estimated the number of substitutions on each branch of the phy- logenetic tree, and compared the mean and variance of these counts for each branch. Again, there was a highly significant excess variance Gillespie (1989) examined the ratio R of the variance to the mean number of substitutions in a set of four nuclear and five mitochondrial genes in mammals, and found that R ranged from 0.16 to 35.55. (The value of 35.55 is for cytochrome oxidase II, which shows 65 amino acid differences between human and mouse, 61 dif- ferences between human and cow, and only 21 differences between mouse and cow.) Gillespie argued that the large range of R implies a sixfold differ- ence among mammalian lineages in rates of nucleotide substitution This 338 Chapter 8 excess variance in substitution rate has been called an "episodic clock," char- acterized by periods of stasis alternating with periods of rapid substitution. Why does the clock appear to be episodic? One possible reason is that the substitution process is not really a simple Poisson process If instead the rate itself changes in a random or stochastic manner, the data could be fitted much better. Such a process, where the substitution rate for a Poisson process is itself stochastic, is called a doubly stochastic process, and it does indeed seem to fit the data better (Gillespie 1991). Such a compound Poisson process ought to show clusters of rapid change separated by periods of relative qui- escence, a pattern that is generally supported by the data (Gingerich 1986; Gillespie 1989, 1991) One means of causing variation in the substitution rate is natural selection in a stochastically varying environment, and such models can also fit the data satisfactorily (Gillespie 1986). Takahata (1987) has argued that the variance can be inflated by a "fluctuating neutral space" model, in which changes in selective constraints among lineages result in variation in substitution rate among lineages. The dynamics of substitutions are suffi- ciently complicated that a wide range of models can fit the data, but for now, one thing we are sure of is that the simplest Poisson process is not adequate. PATTERNS OF NUCLEOTIDE AND AMINO ACID SUBSTITUTION We have now seen several examples illustrating the general principle that nucleotide substitutions occur at a greater rate than amino acid replace- ments. The difference in rates, sometimes much greater than in these data, results from redundancy in the genetic code. As illustrated in Table 8.2, the codons for eight amino acids contain N (standing for any nucleotide) in their third position, seven terminate in Y (any pyrimidine, which means T or C), and five terminate in R (any purine, which means A or G). Coding sites con- taining an N are called fourfold degenerate sites because any of the four nucleotides will do, and those containing a Y or R are twofold degenerate sites (Li et al. 1985). Because of degeneracies, nucleotides in a gene can change without affecting the amino acid sequence. These changes are called synonymous or silent nucleotide substitutions. Nucleotide substitutions that do change amino acids are nonsynonymous substitutions. Calculating Synonymous and Nonsynonymous Substitution Rates In calculations involving synonymous and nonsynonymous nucleotide sites, the total number of synonymous sites is calculated as the number of fourfold degenerate sites plus one-third of the number of twofold degenerate sites. The total number of nonsynonymous sites in a coding region is defined as the number of nondegenerate sites (nucleotides in which any change results in an amino acid substitution), plus two-thirds of the number of twofold degenerate sites (the latter because, with random mutation at twofold Molecular Population Genetics 339 TABLE 8.2 DEGENERACY IN THE GENETIC CODE Second nucleotide in codon TTYPhe TTRLeu TCW Ser TAYTyr TAR Stop TGYCys TGA Stop TGG Trp CTW Leu CCNPro CAY His CAR Gin CGN Arg ATH He ATG Met ACN Thr AAYAsn AAR Lys AGY Scr AGR Arg GTN Val GCN Ala GAY Asp GAR Glu CGN Gly Note, tn this representation of the standard genetic code, the symbol N stands for any nucleo- tide (T, C^Aor G), the symbol^ Igr any pyrimidine (T or C), and the symbol R for any purine ( Rpx CT- The H in the set of codons for isoleucine (lie) stands for "not-C" ( I, C or A) Degeneracies are as follows N represents a fourfold degenerate site, Y and R represent twofold degenerate sites. The H in the set of codons for isoleucine is consideicd as twofold degenerate, as are the first nucleotides in four leucine codons CTTA, TTG, C FA, and C I'G) and four arginine codons (CGA, CGG, AGA, and AGC) All other nucleotides are nondegenerate degenerate sites, two- thirds of the mutations are expected to result in amino acid changes). These conventions are illustrated above. PROBLEM 8.6 For the sequences of the region of the trpA gene given earlier, calculate the synonymous and nonsynonymous substi- tution rates. Start by using Table 8.2 to assign degeneracy classes to each site. For each difference between E. coli and Salmonella, the dif- ference is synonymous either if the site is fourfold degenerate or if it is twofold degenerate and the change is a transition (that is, A to G or the reverse, or T to C or the reverse). The difference is nonsynony- mous either if the site is nondegenerate or if it is twofold degenerate 340 Chapter 8 T Molecular Population Genetics 341 and the change is a transversion (that is, A or G to T or C). Equation 8.15 is used to estimate the proportion of nonsynonymous nucleotide substitutions per nonsyn- onymous site and the proportion of synonymous substitutions per synonymous site. The degeneracy assignments are therefore as follows: 004 00-1 004 002 002 002 002 004 004 002 004 002 002 002 204 204 004 002 002 004 K12 GTC CCA CCT ATC TTC ATC TCC CCC CCA AAT CCC CAT CAC GAC CTG CTC CCC CAC ATA CCC Val Ala Pro lie Phe lie Cys Pro Pro Asn Ala Asp Asp Asp Leu leu Arg Gin lie Ala LT2' ATC CCC CCG ATC TTC ATC TGC CCC CCA AAT GCG GAT GAC GAT CTT CTG CGC CAG GTC CCA lie Ala Pro lie Phe lie Cys Pro Pro Asn Ala Asp Asp Asp Leu leu Arg Gin Val Ala W5S S SS NN5 ANSWER The stars above indicate differences with Salmonella and the letters below indicate which changes are nonsynonymous (N) and which are synonymous (S). Altogether there are 38 nondegerterate sites, 12 twofold degenerate sites, and 10 fourfold degenerate sites. The total number of nonsynonymous sites is 38 + (2/3)12 = 46, and the total number of synonymous sites is 10 + (1/3)12 = 14. There are three nonsynonymous changes (D = 3/46 = 0.065) and six synonymous changes (D = 6/14 = 0.429). Now we use Equation 8.15 to estimate the proportion of nonsynony- mous nucleotide substitutions per nonsynonymous site and the proportion of syn- onymous substitutions per synonymous site. The number of nonsynonymous nucleotide substitutions per nonsynonymous site is k = 0.068, and the number of synonymous nucleotide substitutions per synonymous site is k ~ 0.635. Estimates of synonymous and nonsynonymous substitution rates for a mammalian protein-encoding gene are plotted in Figure 8.12. A striking observation is that the synonymous rates are generally much greater than the rates of substitution at nonsynonymous sites. These rates are scaled, so that if all mutations were equally likely to go to fixation, the rates would be equal. The depression in nonsynonymous substitution rate is interpreted as being caused by natural selection eliminating those changes that are delete- rious. There also appears to be greater variability of nonsynonymous rates than then? is in the synonymous rates, although even the latter vary by more than twofold. Figure 813 shows that the two rates are correlated, suggesting that either the mutation rates vary from gene to gene or that the constraints on nonsynonymous sites are somehow correlated with those on synony- mous sites. We shall see how this correlation might arise at the end of this section. 2 3 Divergence time (x 10 s ) Figure 8.12 Synonymous sites and nonsynonymous sites in B-globin undergo substitutions at different rates, but to a first approximation, both may appear to exhibit a clocklike substitution process. (From Li et al. 1985a.) One problem that may be apparent with the above method for counting synonymous and nonsynonymous sites is that the status of a particular site may change during evolution. The reason is that changes elsewhere in the codon may make a site that was formerly four-fold degenerate now become two- fold degenerate. In fact, the way the sites are tallied depends on the order in which they are considered. Another way to calculate nonsynony- mous and synonymous substitution rates is to consider each codon and count the number of changes that occurred. For codons that changed al a sin- gle site, the change is scored as synonymous if there was no alteration in the 342 Chapter 8 1 2 3 4 5 6 7 8 9 Synonymous rate {x TO 1 *) Figure 8.1 J Plotting the data of Figure 8.12 in another way, the relative rates of synonymous and nonsynonymous substitutions vary somewhat, but in all cases synonymous rates are lower. (Data from Li, et al. 1985a.) resulting amino acid sequence, and nonsynonymous if there was an alter- ation. When there are two differences in a codon, then it is necessary to con- sider both orders of occurrence, and if we have no reason to assume one order is more likely, then both are considered equally likely. The two orders may have differing numbers of synonymous and nonsynonymous changes. For example, if a codon changes from CCG (proline) to AGG (arginine), it could have done so either through CCG->ACG (threonine)-»AGG or through CCG-^CGG(arginine)->AGG The first possibility entails two non- synonymous changes, whereas the second entails only one. If there are three changes in a codon, there are six possible orders in which they might have occurred. This all-possibilities method, by Nei and Gojobori (1986), seems like an improvement, but actually the estimates come out to be very similar to the method in Problem 8.6. Furthermore, even this method does not avoid the problems of sites changing status due to flanking changes. Far more com- plicated models are needed to fully avoid this problem, but in the end, the estimates that they give are also very similar to the simplest method outlined in Problem 8.6 (Muse and Gaut 1994; Goldman and Yang 1994). Molecular Population Genetics 343 Paralleling the evolutionary rates for amino-acid-changing substitutions, the rates of nonsynonymous nucleotide substitution vary tremendously among different proteins. Among the slowest rates is that of histone H4, for which k - 0.004 x 10~ q substitutions per nonsynonymous nucleotide site per year, and among the fastest is that ol y-interferon, for which k = 2.80 x 10 ^ substitutions per nonsynonymous nucleotide site per year. The average rate among a large number of proteins is very close to the rate found in hemoglo- bin, which is 0.87 x 10" 9 substitutions per nonsynonymous nucleotide site per year (Figure 8.14). As in the examples given here, rates of nonsynonymous nucleotide substitution are usually quite similar to the rates of amino acid replacement in the same genes. In contrast with the highly variable rates of nonsynonymous nucleotide substitutions among proteins, the rates of synonymous substitution are much more uniform. For example, in mammalian genes, the fastest rate of synony- mous substitution is only 3 to 4 times greater than the slowest rate (see Figure 8.14) However, the average rate, k - 4.7 x 10~ M substitutions per synonymous site per year, is not only greater than the average rate of nonsynonymous substitutions, but it is greater than the fastest known rate of nonsynonymous substitutions (for y-interferon). Synonymous rate ( x 10~ 9 per year) Prolactin a-Globin, Histone H3 Amylase Interferon |i Insulin Growth hormone Nonsynonymous rale ( x 10 4 per year) Jnleiferon |5 Prolactin Growth hormone Histone H3 Figure 8.14 Comparison of rates of synonymous and nonsynonymous nucle- otide substitutions. Synonymous rates are generally much faster and much more uniform than nonsynonymous rates. (From Kimura 1986.) 344 Chapter 8 The great variability among proteins in the rate of nonsynonymous nucle- otide substitution, when contrasted with the much smaller variability found in the rate of synonymous substitutions, is illustrated graphically in Figure 8 14. This disparity has been used as evidence in favor of the neutral theory. Interpreted according to the neutral theory, the variation in rates occurs because there are selective constraints on amino acid substitutions that do not operate as strongly on synonymous nucleotide substitutions Not just any amino acid will serve at a particular position in a protein molecule, because each amino acid must participate in the chemical interactions that fold the molecule into its three-dimensional shape and give the molecule its speci- ficity and ability to function. The need for proper chemical interactions and folding constrains the acceptable amino acids that can occupy each site. Although some amino acid replacements may be functionally equivalent or nearly equivalent, many more are expected to impair protein function to such an extent that they reduce the fitness of the organisms that contain them. Thus, the constraints on acceptable amino acids are selective constraints because unacceptable amino acid replacements are eliminated by selection. If an amino acid replacement does occur, its effect on the function of the protein product will depend on many factors, but one of the most important determinants of protein conformation is the charge of the amino acid. Differ- ent amino acid replacements give different numbers of charge changes, and in most cases the smallest change in charge might be expected to result in the smallest conformational change. Peetz et al. (1986) examined the charge changes in the evolution of seven proteins, and found that hemoglobin a, hemoglobin 3, myoglobin, and insulin all accumulated charge changes at a rate slower than expected by random substitution. This finding is consistent with constraints on the conformation of these proteins that limit permissible charge changes. On the other hand, cytochrome c and fibrinogens A and B accumulate charge changes at the expected neutral rate. For comparison of rates, it would be useful to study rates of nucleotide substitution in stretches of DNA wholly devoid of function and therefore subject exclusively to the whims of mutation and random drift. A likely can- didate is found in a class of genes called pseudogenes, which are DNA sequences that are homologous to known genes but that have undergone one or more mutations eliminating their ability to be expressed. Pseudogenes are thought to be completely nonfunctional relics of mutational inactivation, and, in fact, their extremely rapid rate of nucleotide substitution is offered in support of this view. The average rate of nucleotide substitution in pseudo- genes is faster than the average rate found in intervening sequences, flanking regions, and fourfold degenerate (synonymous) sites. Pseudogenes evolve at the fastest rates known, which may correspond to rates of substitution when DNA is completely unconstrained by natural selection. The fact that fourfold degenerate sites evolve more slowly than pseudogenes may be a suggestion Molecular Population Genetics 345 that these sites are not totally lacking in constraint, an idea we shall return to shortly. Rates of nucleotide substitution also vary within protein molecules. Human insulin is a good illustration. The A and B polypeptide chains found in the mature insulin molecule are created by post-translational cleavage of a longer polypeptide known as preproinsulin. Preproinsulin contains a signal peptide for secretion and an internal C-peptide, neither of which are present in the active molecule The rates of nucleotide substitution in these three regions are 0,16 for the A and B chains, 0.99 for the C peptide, and 1.16 for the signal peptide. [As in Li et al. (1985), rates are expressed in terms of nonsyn- onymous nucleotide substitutions per nonsynonymous site per billion years.] In insulin, while there is a sevenfold difference between the maximum and minimum rates of nonsynonymous substitution in different regions of the molecule, the rates of synonymous substitution differ only twofold Moreover, there is a negative correlation between functional importance and rate of nonsynonymous substitution within the insulin molecule Many diverse amino acid sequences can serve as signal peptides provided they are hydrophobic, which suggests that selective constraints on signal peptides maybe reduced in comparison with sequences in mature polypeptides. In insulin, as expected, the rate of nonsynonymous substitution is fastest in the signal peptide and slowest in the functional subunits of the mature molecule. This kind of negative correlation between selective constraint and substitu- tion rate has also been observed in several other proteins (Li et al. 1985). Within- Species Polymorphism So far we have talked only about differences between nucleotide sequences of genes from distinct species. DNA sequence differences between alternative alle- les of the same gene in a single species may also be synonymous or nonsyn- onymous, and it is instructive to compare levels of within-species polymor- phism at synonymous vs. nonsynonymous sites. In this case we do not general- ly talk about substitution rate, but rather quantify the variability with the nucle- otide diversity. Nucleotide diversity, often symbolized with the Greek letter n, is the probability that a sample of a particular nucleotide site drawn from two individuals will differ. It is essentially the heterozygosity at the nucleotide level. Figure 8 15 illustrates the first systematic study of DNA sequence variation in a set of 11 alcohol dehydrogenase alleles of Drosopliila niclmiogaster (Kreitman 1983). Of the 2659 nucleotides sequenced, 52 were variable across the 11 alleles. The nucleotide diversity over the entire gene was 0.0065 ± 001 7, meaning that 99.4% of the time, pairs of alleles will match at a site. The level of nucleotide diversity differs in different regions of genes. Figure 8.16 illustrates the esti- mates of nucleotide diversity found in different parts of the Dmsophitn Adh gene The different parts are the 5' (upstream) flanking region, the 5' tran- scribed but untranslated region, the coding region (nonsynonymous substitu- 346 Chapter 8 5' flanking sequence F ^HraomT Larval lender Exon 2 Intron 2 Exon 3 Consensus C C G IS 2S 3S 4S SS 6S 7F 8F 1 >F 10F 11 F CAA TA TGGC - AT . . , C C TGC TG C TGC TGC AG , AC AG CCC T .A A . TC TC TC TC AGGGGA . G T G T Intron 3 GGAATCTCCACTA G t y untra minted w region 77V 3' flanking sequence ▼ A C — T - CA . TAAC. CA. TAAC . . T . 1 . C A -GTC7CC - crcicc . .cicrcc . •Ci CTCC . c rc tc c c C 4 . C 4 . C 4 G C 4 G C 5 G C 4 . Figure 8.15 Polymorphic nucleotide sites among 11 alleles of the Adh alcohol dehydrogenase gene of D, mritmagaskr. The first line gives a consensus sequence for Adh at sites that vary; subsequent lines give the nucleotides from each copy for the polymorphic sites. A dot indicates that the site is identical to the consensus sequence. The triangles indicate sites of insertion or deletion rela- tive to the consensus sequence. The star in exon 4 indicates the site of the amino acid replacement (threonine-to-lysine) responsible for the Fast-Slow mobility difference in the Adh protein. (After Kreitman 1983.) -*dl Molecular Population Genetics 347 06 £■ 0.05 at | 004 <v 9, 0.03 0.02 0.01 - 5' flank z Figure 8.16 Nucleotide diversity in Adh of Drosoplnhi melanogasfcr tions only, with both the slowest and fastest rates shown), intervening sequences, the 3' (downstream) transcribed but untranslated region, and the 3' untranscribed region. On the average, the fastest rates of substitution occur in intervening sequences and the 3' flanking regions, but the average rates in the 5' flanking regions and the 3' untranslated region are all substantially faster than 0.8K x \0~ 9 , which is the average rate of nonsynonymous substitu- tion in coding regions (see Figure 8.14). Neutralists would argue that the high rates of substitution in noncoding regions and variation among different parts of the coding region result Irorn varying degrees of selective constraints on different parts of the gene. It is to be emphasized that Figure 8.16 depicts the results for just one gene, and in individual instances, especially in compar- isons of closely related species, there may be fewer substitutions observed in flanking sequences than in coding sequences, or fewer changes in synony- mous sites than in nonsynonymous sites. Comparison of nucleotide diversity in different functional regions of a single gene can reveal features of the gene's evolutionary history. For exam- ple, in 71 sequenced Adh genes of DrosophUa mclnitogaster (Kreitman 1983), among 14 substitutions that were observed in the coding region, 13 were silent substitutions. Considering the genetic code and the codon usage in the Adh gene, it is possible to calculate what portion of the substitutions would be silent if all substitutions occurred with equal frequency. This figure is about 30% in the case of the Adh gene in Drosopltiin, which implies that about 70% of the substitutions would be expected to cause amino acid replace- ments. Since only one out ol 14 observed substitutions was an amino acid replacement, such substitutions are greatly underrepresented. This finding is consistent with the view that most amino acid replacements are eliminated from the population by purifying selection. The same logic can be extended 348 Chapter 8 to argue that sequences that are conserved are likely to be functionally important this type of reasoning led to the identification of a new open read- ing frame in the HFV (AIDS) virus genome (Miller 1988). The action of natural selection can sometimes be inferred from levels of synonymous and nonsynonymous polymorphism For genes that determine surface antigens of pathogens or those that determine the major histocom- patibility antigens of mammalian cells, the rates of nucleotide substitution can be quite high. One way to address whether the high rate of substitution is driven by selection is to examine the levels of synonymous and nonsyn- onymous diversity in these genes. For example, Hughes and Nei (1988) found that in the regions coding for the antigen recognition sites in the class 1 MHC (major histocompatibility complex) genes of humans and mice, the rate of nonsynonymous substitution exceeded the rate of synonymous substitution by a ratio of 3 : 1. This ratio is the reverse of that found in the usual situation and in other regions in the same genes, where silent substitu- tions are present in excess. The excess of amino acid replacements is consis- tent with a model in which mutations that generate diversity are often advantageous, and hence natural selection accelerates the substitution process. Endo et al. (1996) developed software to scan the gene sequence databases for cases in which the nonsynonymous rate significantly exceeded the synonymous rate, and they recovered 17 cases. Nine of these 17 cases were cell surface antigens or immune system genes — proteins for which one can easily imagine scenarios in which high levels of diversity are advanta- geous. High rates of nonsynonymous substitution are also found in protein toxins called cohcins that certain bacteria produce to kill potential competi- tors in their immediate vicinity (Riley 1993; Ayala et al. 1994). Implications of Codon Bias Synonymous substitutions occur at a greater rate than nonsynonymous sub- stitutions, implying that they face weaker selective constraints. But are syn- onymous changes completely neutral, or do they too face some form of con- straint? One potential type of constraint occurs through codon preferences, which are correlated with the relative abundance of tRNA molecules that interact with and translate the codons. In bacteria and yeast, for example, highly abundant proteins tend to use codons for abundant tRNA molecules, whereas proteins produced in small amounts tend toward codons for less abundant tRNA molecules (Jkemura 1985). A plot of the frequency of use of the synonymous codons that code for leucine shows that CUG is much more frequent than the others, corresponding to an increased abundance of this tRNA. A second potential constraint on synonymous substitutions occurs through possible secondary structures that the RNA might form, in which certain nucleotides must undergo base pairing (see the next section for an elaboration). Pre-messenger RNA secondary structure may influence the speed or accuracy of intron splicing, rate of transport, or stability. A third Molecular Population Genetics 349 potential constraint on synonymous substitutions is related to the fact that, during translation, the probability of misincorporation of the wrong amino acid increases if (here is a pause while the translation machinery waits to find a rare tRNA. Such translation errors are known to occur (in fact, mis- translation of an mRNA that bears a fiameshift mutation can yield an active protein). Pausing during translation may also be ol importance to the fold- ing of the protein into its proper three-dimensional structure. If synonymous codons are neutral, then one would expect their frequen- cies of use to correspond to the product of the nucleotide frequencies. If all four bases were equally frequent, all synonymous codons should be used equally frequently. A more subtle way to test for departure from equal codon use is to count the incidence of polymorphisms and substitutions toward or away from the most abundant codon. If the most abundant codon became the most abundant by chance, then the substitutions toward and away from this codon should show no bias. But if the most abundant codon is "pre- ferred," then there will be a deficit of substitutions away from this codon Application of this kind of approach for codon bias in E. colt suggested an average selection coefficient against disfavored codons of about s = 7.3 x 10" 9 (Hartl et al. 1994). Even Drosophih, whose effective size might be around 10*, appears to exhibit significant codon preference (Akashi 1995), suggesting that selective constraints on synonymous codons must be greater than W 6 in this organism (Figure 8.17). From the selectionist viewpoint, while granting that substitutions in pseudogenes may be neutral, and synonymous substitutions may be con- strained by natural selection only weakly, it is nevertheless maintained that nucleotide substitutions that change amino acid sequences are inevitably subject to the action of natural selection of an intensity that is sufficient to counteract the effects of random genetic drift. Thus, selectionists would argue that amino acid substitutions that have occurred in a protein during the course of evolution became fixed by natural selection because they increased the fitness of the carriers through improvement in function of the molecule. However, neutralists argue back, the selectionist viewpoint cannot easily explain the negative correlation between functional importance and rate of substitution within proteins Furthermore, a neutralist might add, even a slightly detrimental mutation has some chance of being fixed unless a popu- lation is very large (Chapter 7). POLYMORPHISM AND DIVERGENCE IN NUCLEOTIDE SEQUENCE DATA The effects of varying the neutral mutation rate on levels of polymorphism within a species and the interspecific divergence in nucleotide sequences are plotted in Figure 8.18. The theory is consistent with the idea that genes with a high rate of nucleotide substitution, as indicated by a large number of 350 Chapter 8 CIV. TIG CTC CTT CTA IT A 4341 1566 1417 777 705 351 Observed 1767 1241 179.3 1259 1535 107R Expected Figure 8.1 7 The frequency of the six codons that encode leucine in Drosophila melmiogaster is not uniform. This kind of codon bias, in which one codon is present in excess, is commonly observed. (Data from FlyBase, http://cbbndges.harvard,edu-7081.) interspecific sequence differences, should also have a high level of intraspe- cific polymorphism. Polymorphism depends only on the product of the neu- tral mutation rate and the effective size, through the formula H = 9/(1 4- 9) that we encountered in Chapter 7 For strictly neutral genes, interspecific divergence does not depend on the population size, but instead follows the formula Jt = 2uf. If we compare two genes, the level of intraspecific polymor- phism would let us estimate a B value for each gene. Given the value for gene A and the observed interspecific divergence, an estimate of the diver- gence time could be estimated. For gene B, we would also have a e esti- mated from the level of polymorphism, and we could use the divergence time estimated from gene A to determine a predicted value of divergence in gene B. Molecular Population Genetics 351 e = 4N[i Figure 8.18 Reasoning behind the HKA test. Consider two genes, A and IJ, that differ in neutral substitution rate. B can be estimated for each gene based on observed levels of nucleotide heterozygosity (top panel) Given the observed divergence between two species in gene A (determined by the neutral mutation rate and time), the divergence in gene B can be predicted based on its neutral substitution rate, and the divergence time obtained from gene A. The HKA test is a goodness-of-fit test to the observed levels of intraspecific diversity and inter- specific divergence under a model whose parameters ,ue population sizes, neu- tral mutation rates, and times of divergence. The above reasoning has been formalized in a popular test of neutrality based on nucleotide sequence data within and among species (I ludson et al. 1987) Sequences of at least two genes from a number of individuals of each of two species are needed to apply the test. Define 5* and S 1 ,* as the number of polymorphic nucleotide sites in gene / in species A and B, respectively, and d, as the number of differences in gene / between a pair of alleles sampled 352 Chapter 8 randomly, one from species A and one from species B. The expected values of these parameters are obtained from the infinite-sites neutral model, assum- ing that the two species diverged t generations ago, that the population sizes are 2W and 2Nf, and that each gene has an associated 9, = 4Nu r Estimates of 0,,/, and f are obtained by a least-squares method that gives the best fit of the expressions for the expected values and variances ol S* S?, and d, to the data, and goodness-of-fit is tested with an appropriate chi-square test. Using data from the Adh coding and 5' flanking regions in D melanogaster and D. sechel- \va r Hudson et al. (1987) found that the observed values deviated significant- ly from the neutral model in a direction consistent with the operation of balancing selection acting on the coding region of Adh. This finding is consis- tent with Kreitman's (1983) observation of an excess of silent substitutions in Adh, except that the test of Hudson et al. makes use of the genetic variation observed within and among species. The "HKA" test has seen many appli- cations in molecular population genetics (Kreitman and Hudson 1991; Aguade et al. 1992; Begun and Aquadro 1993; Gaut and Ciegg 1993). PROBLEM 8.7 In a set of 12 Adh sequences in Dros&phih met- anogaster, McDonald and Kreitman (1991) observed 42 silent (synony- mous) polymorphisms and two replacement (nonsynonyrnoua) polymorphisms. As had been concluded by Kreitman (1983), this suggests that most replacement mutations are deleterious and are eliminated from the population. When they examined fixed differ- ences between melanogaster and either D. simulans or D. yakuba, they found that seven of the fixed differences were replacements and sev- enteen were silent. What is the significance of this observation? ANSWER A null hypothesis might be that the effects on fitness of a mutation would be the same whether within a species or at any time along the ancestral history of two species back to the common ances- tor. If this is true, then we would expect the ratio of silent to replace- ment polymorphisms to be the same as the ratio of silent to replacement fixed differences. A simple test of this is to do a 2 x 2 con- tingency chi-square: Fixed Potymotpblc Replacement Silent 7 17 2 42 Molecular Population Genetics 353 For this table we get % 2 = 8.20, and with one degree of freedom, P < 0.01. (A correction is often applied to the chi-square for tables with counts less than 5, but it does not make much difference in this case.) The low probability means we reject the null hypothesis and conclude that, within species, there is a tendency to avoid replacement poly- morphisms; however, between-species replacement differences are much more likely to occur. McDonald and Kreitman (1991) argue that this pattern is consistent with adaptive fixation of amino acid replace- ments, since they are relatively more frequent in interspecific compar- isons, and such adaptive polymorphisms would be less common than neutral polymorphisms because adaptive differences would not remain polymorphic for as long a duration. This simple test is useful in assessing the relative importance of neutral drift versus selection in interspecific differences. Impact of Local Recombination Rates Recall from Chapter 5 that the level of polymorphism in Drosophila shows a striking correlation to the local rate of recombination. Regions of low recom- bination rate are nearly devoid of variation, whereas regions with high rates of recombination are highly polymorphic. The idea of comparing polymor- phism and divergence makes this pattern even more striking and allows us to eliminate a possible cause. One possible reason for the correlation is that recombination itself is mutagenic, or that somehow the two processes are related mechanistically. (That is, perhaps when mutations occur, the DMA configuration is altered to increase the chance of recombination.) If this were the case, then the regions of low recombination rate should also have a low mutation rate, and hence lower interspecific divergence. Figure 8.19 shows that a lower divergence is not observed Levels of interspecific divergence are independent of local recombination rates. The conclusion is that the correla- tion between recombination rates and levels of polymorphism observed by Aquadro et al. (1994) must be due to more rapid elimination of the variation in regions of low recombination. Two known mechanisms that remove variation faster in regions of low recombination are selective sweeps and background selection Background selection is thought to be the primary mechanism for the reduced variation (discussed in Chapter 5 in the section on linkage and recombination), but this does not mean that sweeps do not occur. Selective sweeps occur when a favorable mutation takes place, and selection rapidly increases its frequency. Such sweeps can have a dramatic effect on levels of variation in the selected 354 Chapter 8 0.08 V g 006 a* • • • .a 004 1 • 1 ™ 2 • i i i i 01 02 03 04 Coefficient of exchange Figure 8. 19 The striking correlation between local rates of recombination and levels of intraspecific nucleotide diversity cannot be explained by a lower muta- tion rate in regions of low recombination. If regions ol low recombination had low rates of mutation, the interspecific divergence would be lower in these regions That it is not is shown by these data. (From Aquadro et al. 1994.) gene and the region around it. The size of the "swept" region depends on the rate of recombination and is larger for regions of low recombination. This means that the chance that a particular site has been swept free of variation is greater in regions of low recombination, assuming the density of selective sweeps is uniform across the genome. An example of a selective sweep is an esterase B allele in the mosquito that is associated with pesticide resistance (Figure 8.20). The resistant allele has apparently undergone a nearly global sweep, judging from the near monomorphism of the esterase B gene (Ray- mond ct al. 1991; Ffrench Constant etal. 1991). We do not know how frequent such sweeps are, but one possible means of identifying them is to score many highly polymorphic markers in many populations and look for regions of reduced variation. Schlotterer et al. (1997) performed such a survey and found several cases of individual genes in single populations that were depauperate in variation, perhaps due to a local sweep event. GENE GENEALOGIES There is an important distinction between the construction of trees from sequences of genes from different species and from sequences of alleles from a single species. The former yields a customary phylogenetic gene tree, while the latter produces what is called a gene genealogy. The relationships among species result from macroevolutionary processes, whereas allelic dif- ferences result from a number of microevolutionary processes, including aspects of genetic transmission. Once the nucleotide sequences of alleles are ikb Molecular Population Genetics 355 8 gene fein-R OS S-Lab G KX MSE Egyp' A X GS O HCERHRBCCE X AR 1. [>HK California AU4-+ Y 1 T - XLP B GA HE RHR LfiOPG CRS AH ^t ^ T [I B LOA P H EC RHRBGH i ii i ill \r = XE B A K O CX LBPGA HE RHRB AO SX H PCL E CX LBPGA HE RHRB AO SX H PCL E - I I I II I I II ll r -m-r- II I T I ±fc z^z z^z z%z CX LBPGA HE RHRB AO SX H PCL E I I I I I M I I ll ll l l h = _U I X_dr R B Congo C X L BF G A HE RHRB AO SX H PCL E Ivory Coast CX LBPGA HE RHRB AO SX H PCL E R B Pakistan CX LBPGA HE RHRB AO SX H PCL E R B R B Figure 8.20 Restriction maps of the esterase B gene from global samples of the mosquito Cukx pipiens. Note the identity of a haplotype from Egypt through Texas. This haplotype is associated with insecticide resistance, and probably underwent a global sweep in the face of strong selection (From Raymond et al. 1991 ) known, the different alleles can be treated like genes in different species in applying standard methods for inferring a phylogenetic tree. However, great care is needed in constructing gene genealogies, because recombination among the sequences results in a gross violation of the assumptions of most tree- building methods. Provided the rate of recombination is not too high, localized blocks of sequence can be identified in which there appears to have been no recombination in the ancestral history of the sampled alleles. With this caveat, gene genealogies can be of great use in inferring the evolutionary 356 Chapter 8 c ]a-F Af-F Wa-F Fr-F jFi-r Jd-S Fr-S Af-S FI-2S FI-1S Wa-S 0.010 0.008 006 0004 0.002 Number of nucleotide differences per site Figure 8.21 A phylogenetic free for 11 Adh alleles of Drosopkila melanogaster based on 43 nucleotide differences The scale is the number of nucleotide differ- ences per site. Ja: Japan; Af: Africa; Wa: Seattle, Washington, Fl: Southern Flori- da; Fr France. S and F refer to the slow and fast electrophoretic forms, (Data from Kreitman 1983.) history of a polymorphism. For example, they can reveal which of a group of alleles is older, or which alleles are more closely related to each other. Figure 8.21 shows the gene genealogy from Kreitman 's (1983) Adh sequence data, and the higher diversity of the S allele clearly makes it appear to be older. Hypothesis Testing Using Trees Beyond the descriptive approach to showing relationships among alleles, gene genealogies can be used lo test fundamental forces of population genet- ics, including natural selection. For example, consider a phylogenetic tree based purely on neutral variation. As illustrated in Figure 8.22A, when the substitution rate is u, the expected time to coalescence to a common ancestor for a randomly chosen pair of alleles is AN generations (Chapter 2). Under a model like Ohta's (1973), where many mutations are slightly deleterious, the tree is not changed very much because the alleles included in a sample are Molecular Population Genetics 357 (A) No selection J= 05 c Generation (B) Purifying selection Generation Figure 8.22 Computer simulations of the infmite-allole model of molecular evolution. (A) With strict neutrality, the expected lime from mutation to fixation of alleles that will go to fixation is 4N„ generations (B) Purifying selection (in this case with half of the mutations having a fitness of 5) results in less poly- morphism at any given time. (C, next page) Stabilizing selection (overdomi- nance or frequency dependence) can retain alleles in a polymorphic state for much longer times. Representative trees are plotted to the right of each panel. 358 Chapter 8 (C) Stabilising selection 10 £ 5 £ Generation the subset of mutations that occurred that were nearly neutral On the other hand, in the case of adaptive mutations, the rate of fixation would be much faster than with neutrality, so that sites of adaptive mutation would have shorter coalescence times than Hanking neutral sites (Figure 8.22B). Finally, with balancing selection (heterozygote advantage), polymorphisms would be maintained for a longer time than under the pure drift model (Figure 8.22C). The number of statistical methods for inference of population genet- ic forces from gene genealogies is increasing rapidly, and there is ample opportunity for exciting progress in this area. PROBLEM 8.8 A study of variation in the gene encoding superox- ide dismutase in Drosophite metamgaster {Hudson et al. 1994b) revealed 63 polymorphic sites in three slow alleles and 22 fast alleles (where fast and slow refer to the mobility of the protein product in an electrophoretic gel). An additional 16 slow alleles were separately scored, giving a total of 19 slow alleles that were found to be identical in nucleotide sequence. The fast allele broke into 10 distinct haplo- types, and the most common was FastA with nine copies. The partial table of pairwise counts of numbers of sites that differ between alleles is: Molecular Population Genetics 359 FastA FasM FastS Fastf FmtK 3 4 9 16 2 3 8 15 3 10 17 11 18 7 Slow FastA FastH FastB Fast] How would you address the question of whether this sample is typi- cal of a sample from a neutral gene? ANSWER The aspect of the pattern of variation that is unusual is that the fast alleles appear to be quite variable, whereas all 19 slow alle- les are identical. A gene genealogy of the jasf alleles would look like a typical neutral tree with roughly exponentially distributed branch lengths, but the complete tree would then have 19 identical slow alle- les placed one substitution away from FasiA. The suspicion is that the slow allele must have arisen recently and is being pulled to high frequency by selection. An observed increasing trend in the slow allele frequency supported this conjecture, To make a formal tesl out of this observation, Hudson et al. (1994b) used the coalescent procedure described in Chapter 7 to generate simulated data sets with a sample sLze of 25 and having 63 polymorphic sites. For each of the 10,000 sim- ulated samples, they asked, how often is there a set of 12 alleles that differ by or 1 substitutions? (The 9 FastA alleles and 3 slow alleles in the original observed sample differ at just 1 site.) The answer was 81 of the 10,000 cases, giving a probability of 0.0081 . The observed sam- ple is not a likely occurrence under neutrality. It is instructive to note that these data were consistent with neutrality by the Fu and Li (1993) test, the Tajima (1989) test, and the I IKA test (Hudson et al 1987), demonstrating that even strong departures from neutrality may be missed by these standard tests. This problem illustrates a common principle in molecular population genetic analysis, which is that ad hoc approaches tai- lored to particular observations often are necessary. The topology of gene trees affords an opportunity for yet another test of goodness of fit of data to the neutral theory. We saw in Chapter 7 that the coalescent approach provides a description for the expected topology of a gene tree under the infinite-sites model. In particular, the expected time back to the next pieceding coalescent event is exponentially distributed r 360 Chapter 8 T Molecular Population Genetics 361 with parameter 1/ where k is the current number of distinct alleles. A test of Fu and Li (1993) makes use of the fact that the model predicts a rela- tionship between 6 and the number of "external mutations." An external mutation is a mutation that occurs on a branch of the gene genealogy that terminates in an observed allele (an external or terminal branch). The remarkable observation that Fu and Li made was that the expected number of external mutations is 9, independent of sample size. The test is based on the idea that selection will affect the number of external branches more than it will affect internal branches, and Fu and Li devised test statistics for good- ness of fit between observed and expected numbers of external mutations. The test has some advantages over the Tajima test, but Simonsen et al. (1995), after extensive simulations to test the power of various neutrality tests, conclude that the Tajima test (see Problem 7.10) is generally the most powerful against alternative hypotheses of selective sweeps, population bottlenecks, or population subdivision (but see Problem 8.8). Inferences about Migration Based on Gene Trees Data from a panmictic population obeying the infinite-site model will have a characteristic gene tree topology. II the population is divided into two semi- isolated groups, the alleles within each group will, on average, be more sim- ilar to one another than comparisons between groups. This would mean that a gene tree in such a subdivided population would have two major clades corresponding to the two populations For higher levels of migration, the gene tree will be somewhere between these two extremes. Slatkin and Maddison (1989, 1991) devised means for estimating the number of migrants per generation, Nm, from the inferred gene genealogy. In essence, the approach uses parsimony to obtain a direct count of migration events com- patible with the tree, and Nm is estimated from this count. With sufficient DNA sequence data, one can be confident that identical alleles are truly identical by descent, an important aspect of inference of pop- ulation history and migration. In their analysis of D. pseudoobscura Adh sequences, Schaeffer and Miller (1991) found that geographically distant pop- ulations had identical alleles, and more generally, that the gene tree did not partition geographically, as though the population were panmictic. This was an exciting result, given the extraordinary level of population subdivision in D. pseudoobscura third chromosome inversions. It implies that the latter sub- division is not just a historical accident, but is being maintained in the face of sufficient migration to homogenize other sorts of genetic variation. The data of Bowcock et al. (1994) show yet another aspect of very high resolution molecular data. After constructing a tree based on 30 microsatellite loci, they observed that human samples showed a significant tendency to cluster by continent Although lower- resolution methods had shown some degree of dissimilarity among groups of humans, this was the first study to show that reduced intercontinental migration was sufficient to partition human genetic variation. MITOCHONDRIAL AND CHLOROPLAST DNA EVOLUTION We already saw in Chapter 5 that mitochondrial DNA can be highly infor- mative about the geographic structure of populations. Some of the advan- tages of using DNA sequence variation from this organelle genome include- • The DNA molecule (in most animals) is relatively small and easy to isolate. • It is present in multiple copies per cell; therefore, older and less well pre- served samples are still likely to yield useful information. • The mitochondrial genome does not undergo recombination, so it is more likely to show a clean branching structure to its gene trees. • It evolves rapidly. The primary problems with mtDNA are; • The absence of recombination means that the gene tree constructed from any mitochondrial DNA gene will reflect just a single realization of the genealogical process. As such, the data will not be as informative about species or population trees as, say, a dozen nuclear genes. • Much of the work with mtDNA has been on the control region sequence. While this region is highly variable, the variability occurs at a subset of sites that are so mutable that multiple substitutions often occur. In animals, mitochondria are usually inherited through the egg cytoplasm (maternal inheritance) and are genetically uniform within an individual The mitochondria] genome consists of a single circular DNA molecule, denoted mtDNA, the size of which varies over a remarkably narrow range in different species of vertebrates (15.7-19.5 kb), averaging about 16 kb. Human mtDNA is fairly typical, containing a control region for the initiation of DNA replica- tion, genes for two ribosomal RNA molecules, 22 transfer RNA molecules, and 13 proteins. Twelve of the proteins are subunits of enzyme complexes that carry out electron transport and ATP synthesis. The genetic code of mammalian mitochondria differs from the standard code in that ATA codes for Met, TGA codes for Trp, and AGR codes for End (termination of protein synthesis); thus, every codon in the mitochondrial code can be written as either NNY or NNR. Animal mitochondria also contain several hundred enzymes used in metabolic functions, but these are coded for by nuclear genes, and the enzymes are transported into the mitochondria. At the nucleotide level, the rates of substitution in mammalian mtDNA are typically 5 to 10 times greater than occur in single-copy nuclear genes, averaging approximately 10 x 10"" substitutions per nucleotide site per year. The reason for the high rate of substitution is thought to be either a high rate of nucleot ide misincorpora tion or a tow efficiency of repair of the DNA poly- merase. Support for the latter view comes from the observation that, unlike 362 Chapter 8 20 40 W Divergence time (million years) Figure 8.23 Relationship between percent sequence divergence (WOd) and divergence time. The points represent estimates from pairwise comparisons of restriction endonuclease cleavage maps The initial rate of mtDNA sequence is shown by the longer dashed line and the rate of divergence of single-copy nuclear DNA by the shorter dashed lme. (From Brown et al 1979.) the nuclear DNA polymerase, the mitochondrial DNA polymerase lacks the proofreading function. In protem-coding mitochondrial genes, the rate of synonymous substitution is about five times greater than the rate of nonsyn- onymous substitution, which is comparable with the ratio found in nuclear genes Mitochondrial tRNA genes in mammals evolve approximately 100 times as rapidly as their nuclear countei parts (Brown 1985; Avise 1986). One result of this faster rate of nucleotide substitution is that the divergence between two sequences saturates relatively soon, so that the linearity of divergence over time (the molecular clock) is an accurate approximation only for species that have diverged less than about 10 million years (Figure 8.23). Exceptions to the elevated rate of mtDNA divergence have been found, notably in Drosophihi (Powell et al. 1986) PROBLEM 8.9 The mitochondrial DNA of 21 humans of diverse geographic and racial origin were digested with 18 restriction enzymes, 11 of which exhibited one or more fragments in which size polymorphism occurred (Brown 1980). All restriction site polymor- phisms could be explained by single-nucleotide differences, thus there was no evidence for insertions, deletions, or other mtDNA rearrangements. Altogether, 868 nucleotide sites were assayed for dif- ferences among individuals, and the average number of differences per nucleotide site per individual was estimated at 0.0018. Assuming that mammalian DNA undergoes sequence divergence at the rate of 5 to Molecular Population Genetics 363 10 x 10 -9 nucleotide substitutions per site per year, and that the rate is uniform in time, calculate the length of time since all of the 21 contem- porary mtDNA molecules last shared a common ancestor. Calculate the effective size of the population from the level of mtDNA variability. ANSWER Given an average number of differences per nucleotide site per individual of 0.0018 and an average rate of divergence of 5 to 10 x 10" g per site per year, the time of the most recent common ancestor would be between 0.0018/(10 x 10^) and 0.0018/(5 x 10"*) or 180,000 to 360,000 years. Assuming a generation time of 20 years, this means that all mtDNA in the diverse sample could have been from a single female in the population between 9,000 to 18,000 generations ago. To estimate the long-term effective size of the population, recall that the expected time to fixation of a newly arisen neutral mutation is 4W, generations. This result applies to an autosomal gene in a diploid species. For mito- chondrial genes, only females transmit them, and they are effectively haploid, so the corresponding fixation time for mtDNA is just N, gener- ations. If we argue that the one mtDNA type went to fixation in 9,000 to 18,000 generations, this is equivalent to saying that the long-term pop- ulation size has been N e = 9,000 to 18,000. This sounds like a low num - ber, but modern anthropologists find it reasonable, given the population structure of ancient humans and the rapid, nearly starburst- like growth since the adoption of agricultural methods. One of the most dramatic claims in the history of population genetics was that human genetic variation in mtDNA indicates n recent African origin of modern humans (Cann et al. 1987). This claim was based on restriction site variation among mtDNA of 147 humans in five populations. The 12 restric- tion enzymes sampled an average of 370 restriction sites per individual, equivalent to assaying 9% of the mtDNA genome per individual A total of 195 polymorphic sites were found in the genome, and the precise location on the mtDNA sequence of all polymorphic sites was identified. When the 133 distinct mtDNA haplotypes were assembled into a phylogencfic tree, a clade was lound in which the most ancient branch pointed to a group of people of African ancestry (Figure 8 24). Given the observed number of differences between the two most divergent mtDNA types, and assuming (here is 2 to 4% divergence in mtDNA sequences per million years (estimated from the human-chimp split at 5 MYA), the common ancestor to all of the observed haplotypes was estimated to have existed 140,000 to 280,000 years ago 364 Chapter 8 0.2 0.4 6 Sequence divergence % 6 4 0.2 Sequence divergence % Figure 8.24 Parsimony tree of mtDNA variation from the original "mitochon- drial Eve" paper Much was made of the observation that there is an isolated clade consisting only of Africans. (From Cann et al. 1987.) Molecular Population Genetics 365 Sequences of the control region, which diverge at a rate of 1 2 to 15% per mil- lion years, produce a date for the common ancestor of 166,000 to 249,000 years ago (Vigilant et al. 1991). Several other data sets have been collected to address the issue of date, and all have produced estimates of the date of the common ancestor of human mtDNA of between 100,000 and 400,000 years ago (Hasegawa and Horai 1991; Pesole et al. 1992; Ruvolo et al. 1993). These figures, and their interpretation, have launched a controversy centering on. (1 ) the best way to infer the time of the common ancestor, (2) the meaning of higher African diversity, (3) the confidence in an African root, (4) the neutral- ity of human mtDNA variation, and (5) the implications for human evolu- tion. Whether modern humans migrated out of Africa in the past 200,000 years may not he supported with statistical rigor by mtDNA alone (Temple- ton 1993), but when haplotypes of nuclear genes (Tishkoff et al. 1996), or when many nuclear genes are considered in addition, the case for African origin is strong (Nei and Roychoudary 1993). We must be careful to realize, however, that the fact that Africa has the greatest genetic diversity today does not by itself guarantee that modern humans originated in Africa. If the African population has had a long-term effective size much larger than other populations, or if the other populations suffered a bottleneck that Africa did not, then Africa would be more diverse no matter where humans originated (Relethford 1995). In addition, just because a gene genealogy appears to have a root that coincides with an African allele does not mean that modern humans came from an expansion of the African population to cover the earth, it only means that the one gene has the observed ancestral history. Other genes may trace back to other origins. Inferences about human origins from extant patterns of genetic variation require an understanding of nonequilibrium models, where populations grow in size, new colonies are founded, and populations remained connect- ed by some level of migration. Recently there has been much attention paid to the influence of past changes in population size on patterns of variation. It was observed that a growing population produces a gene genealogy that has a more starlike shape than does a stationary population, and this in turn pro- duces a peak in the distribution of pairwise counts of mismatches (Slatkin and Hudson 1991; Rogers and Harpending 1992). The use of patterns of human genetic variation to make inferences about our ancestral history is an active and lively area of inquiry. Chloroplast DNA and Organelle Transmission in Plants Chloroplasts are cellular organelles that also have their own genome and also are transmitted in a non-Mendelian fashion. Chloroplast DNA (cpDNA) ranges in size from 135 to 160 kb, and it occurs in multiple copies in each chloroplast. Its structural organization is conserved in higher plants, and the rate of synonymous nucleotide substitution is approximately 1 x 10 y substi- tutions per site per year. Thus, the evolution of cpDNA is conservative in 366 Chapter 8 TABLE 8.3 RATES OF SEQUENCE AND STRUCTURAL EVOLUTION IN ORGANELLE DNA Genome Rate of nucleotide substitution Rate of structural evolution Angiosperm cp DNA Angiosperm mtDNA Mammalian mtDNA Fungal mtDNA Slow Slow Rapid Rapid Slow Rapid Slow Rapid regard to both sequence and structure (Table 8.3). The opposite extreme, with a very fast rate of evolution, is found in the mtDNA of fungi, which changes rapidly in both sequence and structure. The mtDNA of angiosperm plants has the opposite pattern of evolution as found in animal mtDNA. In sequence evolution the rate in angiosperms is slow, but in structural evolution it is fast. In plants, the mtDNA genome is large and highly complex. In some instances, a single molecule can resolve itself into smaller circles and even linear molecules. For example, in the turnip (Brassica campestris), a 218 kb molecule undergoes an internal recom- bination event that produces smaller circles of 135 kb and 85 kb. Maize mtDNA contains six pairs of repeated sequences that can undergo recombi- nation and create a variety of structural derivatives. The Ambidopsis mtDNA genome was recently sequenced, and although it is 366 kb, nearly all the increase in size compared to mammalian mtDNA is noncoding (Unseld et al. 1996). Many plant mitochondria also contain autonomously replicating plas- mid DNA molecules, and mtDNA is also capable of incorporating segments of cpDNA. Why plant mtDNA genomes are so large, complex, and variable in size is not understood. Maintenance of Variation in Organelle Genomes Organelle genomes have unusual population genetics because of their (typ- ically) uniparental transmission and because many copies are passed from the mother to the progeny through the egg. Uniparental transmission has important implications in the operation of natural selection, since it is equivalent to a haploid clonal population structure, and pure selection mod- els can maintain polymorphism in such populations only if the fitnesses are frequency dependent. From the outset, then, uniparental transmission makes it less likely for polymorphisms to be maintained by natural selec- tion, even if epistatic effects with the nuclear genome are allowed (Clark 1984). The widespread polymorphisms observed in mlDNA must then be atlributed largely to high mutation rates, just as the rapid substitution rate Molecular Population Genetics 367 was attributed to a high mutalion rate. Polymorphisms can also be main- tained by interspecific hybridization, and itis possible to obtain estimates of rates and directions of interspecific ma tings from nuclear and mtDNA data (Asmussen et al. 1987). Unusual forms of transmission, such as the doubly uniparental transmission of the mussel Mytdus edulis, results in sep- arate male and female lineages, which are highly divergent (Skibmski et al 1994, Stewart etal. 1995). The theory of random genetic drift for organelles is more complex than that for nuclear genes because individual cells have many organelles that are apportioned among daughter cells; thus there is an additional level of sam- pling when heteroplasmic cells divide. Models of the dual sampling process have been examined in some detail (Birky et al. 1983; Takahata 1983, 1984) These models predict some level of heteroplasmy, and although early empir- ical studies did not detect heteroplasmy, it has now been described in crick- ets (Harrison et al. 1987), Drosophila (Hale and Singh 1986; Solignac et al 1983, 1984, 1987), lizards (Densmore et al 1985), mice (Boursot et al. 1987), cattle (Hauswirth and Laipis 1982), frogs (Monnorot et al. 1984), treefrogs, and bowlin fish (Bermingham etaJ. 1986). Heteroplasmy can be maintained by a steady-state balance between the forces of random genetic drift and mutation, but heteroplasmy is most frequently observed in restriction length polymorphisms, in which variants differ in the number of copies of a small repeat. Simple deterministic models show that heteroplasmy can be stably maintained by infrequent paternal transmission (leakage), by natural selec- tion, or by bi-directional mutation, such as the gain/loss events one would expect for changes in copy number of a small repeat (Clark 1988). Distribu- tions of heteroplasmy in the field cricket are consistent with a model of mutation-selection balance, with smaller genomes favored by selection (Rand and Harrison 1986) Evidence for Selection in mtDNA There are several clear examples of nonneutralily of mtDNA mutations. For example, many forms of cytoplasmic male sterility are caused by defects in mtDNA (Grun 1976; Levings 1983). Similarly, cytoplasm ically transmitted drug resistance genes have been shown to be associated with the mitochon- drial genome of yeast. The potential importance of mtDNA variation in human health was revealed in the implication of mitochondrial DNA defects in the muscle diseases known as mitochondrial myopathies The celebrated bicycle racer Greg Lemond, a three-time winner of the Tour de France, was forced into early retirement by a defect in mitochondria! oxidative metabo- lism. Effects of natural selection also have left their mark on cxtanl patterns of mtDNA sequence variation, as revealed by [he discordance between levels of polymorphism and divergence in synonymous vs. nonsynonymous sites (Ballard and Kreitman 1994, Rand and Kann 1996) The strongly 368 Chapters skewed distribution of frequencies of segregating sites also suggests that human mtDNA has faced selection pressure (Hey 1997). If a cytoplasmically related factor of any sort is associated with a particular mtDNA type, then the mtDNA will "hitchhike" along with the other cytoplas- mic factor. A striking example of this mode of evolution in action was caught by Turelli et al. (1992), when they noticed that a cytoplasmically transmitted Wol- bachia infection in Drosophila shmiltms was rapidly spreading north in California, and as it did so, it propelled a single mtDNA type to high frequency. While the mtDNA genome may seem small, its uniparental transmission makes it suscep- tible to any cytoplasmic factor that may carry a particular cytoplasmic type to fixation. However, most populations have fairly high levels of mtDNA variation, suggesting that such sweep events are not very common. MOLECULAR PHYLOCENETICS The use of techniques of molecular biology particularly those for determin- ing amino acid or nucleotide sequences, has added a new dimension to phy- logenetic inference. For example, the analysis of 5S RNA sequences in a broad variety of microorganisms has led to a reclassification at the deepest of phylogenetic levels, resulting in a new kingdom, the Archaea (Woese 1981). In addition to the satisfaction of understanding the history of rela- tionships among living things, the application of comparative molecular analysis to infer robust and accurate phylogenetic relationships has spawned interest in the application of those phylogenetic trees for testing hypotheses about evolutionary mechanisms The problem of inferring the correct branching topology for a tree that relates a set of organisms is a challenge in part because oi the enormous number of possible bifurcating trees. If there are n species to be placed, there are {In - 3)'/2"~ 2 (it - 2)! rooted trees that describe possible ancestral histories. For five species this number is 105, and for 10 species it is 34,459,425. For many data sets of 30 or more species, the number of possible trees is so enormous that it is not possible to examine all topologies and assess the fit of the data to each tree, even with the very fastest computers. Fortunately, the trees are not all independent of one another, and the key to many of the algorithms that try to find the best fit- ting tree is to eliminate whole classes of trees based on the observed data. Let us consider a few of these tree-building methods. Algorithms for Phylogenetic Tree Reconstruction If a gene in a pair of species or populations evolves in clocklike fashion, and if the degree of divergence between two genes implies that they have been diverging for f generations, then we can infer that the genes separated from a common ancestor f /2 generations ago. This reasoning provides a group of methods of tree construction based on measures of genetic distance. One such method is the unweighted pair-group method with arithmetic mean Molecular Population Cenetic-s 369 (UPGMA) or average distance method This method requires that all sequences evolve at the same rate, an assumption that other methods can relax to some degree, but the ease of understanding UPGMA still gives it heuristic appeal With a matrix of all pairwise distances, a tree is built up by first grouping the two species with the smallest distance. A new distance matrix is then constructed, with the grouped species now considered as one unit. If the grouped species were indexed i and /', then for all k * i,j the dis- tance from * to the group \i,jj\ is d m = V 2 {d lk + rf r; ). In words, the distance from each other species k to the group (»,/) is the average of the distances from species k to each of species i and /" in the group. The new distance matrix is again searched for the smallest element, and the appropriate grouping again occurs. This process is repeated until all species are clustered into a tree. Tree-building methods can not only produce a tree topology, but they generally also give estimates of branch lengths of the tree An example of one method for branch-length estimation is the method of Fitch and Margoliash (1967). Suppose the number of substitutions distinguishing sequence / and ;' is d„. If the tree relating sequences 1, 2, and 3 has branch lengths A, B, and C (Figure 8.25), then the branch lengths can be estimated from C=y 2 (rf 13 + d 23 -d l2 ) 8.17 These relations were found by solving the equations d u = A + B, d ]3 = A+C r and d^ = B + C. With more than three sequences, the tree is built up by con- sidering three units at a time, beginning with the two most closely related sequences and grouping the remaining sequences If sequences 1 and 2 are the most similar, then the distances from sequence 1 to the remaining group is the average of the distances from sequence 1 to each member of the group. A B c Species 1 Sprat's 2 Species 3 Figure 8.25 A simple phylogenetic tree. A, B, and C represent branch lengths horn the most recent common ancestor. 370 Chapter 8 In this way, only three distances are considered at a time, and Equations 8.17 allow branch lengths to be estimated. This method is known as least squares, and it turns out that Equations 817 minimize the sum of squared deviations trom the model, much like linear regression. Another algorithm for tree construction is particularly well suited to the situation in which one does not know whether rates of substitution are con- stant across clades of the tree. This method is known as neighbor-joining, because it groups species having the property "neighbors" (Saitou and Nei 1987). Begin by assuming that the sequences are all related to one another by a star phylogeny (Figure 8.26). For a star phytogeny with N sequences, the sum of the branch lengths is (It may help to draw a star phylogeny to see that each branch gets counted N - 1 times.) Next we begin a procedure (hat groups certain sequences together For each possible pair of sequences, a tree like that in step 1 in Figure 8 26 is constructed. Branch lengths for this tree are estimated by least squares, and the sum of the branch lengths for the entire tree (S (/ ) is calculat- ed. We consider as neighbors that pair of sequences i and j that give the min- imum of the 5,/s. After the first pair of neighbors is found, that pair is con- sidered as a single entity (joined neighbors), and the process of considering all possible pairings is repeated. The distance from any one sequence k to this pair of neighbors (/ and /") is the average of the two distances, or '/zf^r* + dik)- The process ends when there are just three neighbors left, and at this point we have a finished neighbor-joining tree complete with branch lengths. The criterion lor neighbor-joining is to minimize the sum of branch lengths, and sometimes if is possible to find tree topologies that are even short- er, using a method called minimum evolution trees (Rzhetsky and Nei 1992). PROBLEM 8. 10 Consider a sample of one allele drawn from each of three species. Suppose that the tree that one gets from these alleles may be represented ((A,B),Q, implying that A and B are most closely related, and C Is the outgroup- What are the possible relationships among the species bearing these alleles? ANSWER This problem bears on an important Issue in phylogeny reconstruction, namely that any one gene tree does not necessarily reflect the true pattern of splitting of species. The easiest way to see Molecular Population Genetics 371 B C D E A 53 99 1.02 82 B 0.80 93 71 C 65 (181 D 94 A E B Figure 8.26 Illustration of the neighbor-joining method for phylogeny recon- struction. Given a distance matrix, one starts with a star phylogeny and tests all trees having different pairs separated from the rest. The tree with A-B joined is the shortest such tree. The process of testing all pairs of "neighbors", where a neighbor may be either a single allele or a cluster of alleles, is repeated until no more joining can be done. (See Saitou and Nei 1987.) this is to consider ancestral populations as being polymorphic, in which case the speriation process may sort out the alleles in various ways. It turns out that the possible species trees include ((A,B),C), {(A,C),B), and ((B,C),A). In other words, the gene tree does not elimi- nate the possibility of any of the species trees. 372 Chapter 8 Distance Methods versus Parsimony There is no universal theory lh.it provides a single optimal way to construct phylogrneiic trees, and as basic as the distance matrix seems, if is not required by all methods. Another method, known as maximum parsimony, uses the smallest number of mutational events necessary to account for the evolution of a set of sequences from a common ancestor to construct the trees. There are a number of such parsimony methods based on trees with the smallest number of substitutions, but none guarantee that the most par- simonious tree is the correct tree. For example, when rates of substitution dif- fer in different branches of the tree, the parsimony method often fails to give the correct topology (Felsenstein 1978). Methods for constructing phyloge- netic trees have been reviewed by Felsenstein (1981, 1982) and more recent- ly by Nei (1996). Massive simulation studies have been done to test the sta- tistical reliability of tree-constructing methods (Rohlf and Wooten 1988; Sourdis and Nei 1988; Hillis 1996). Results of these simulations are easy to summarize: if the data allow one method to assign a topology with good sta- tistical confidence, generally all the popular methods work pretty well. But if the data have many apparent reverse mutations, variable rales among branches, or wide variation in rates across sites, then none ot the methods works very well. Bootstrapping and Statistical Confidence in a Tree Because there are so many possible tree topologies, it is important to assess how much statistical confidence one can place in a particular tree. One can- not assign a numerical standard error to a tree; by its geometrical nature a tree is actually a complicated statement of phylogenetic relationships, such that we might have high confidence in some branches, and low confidence in others. A widely used method of assessing confidence in the nodes of a tree is the bootstrap test (Felsenstein 1985). The basic idea is quite simple: a subset ol the original data is drawn with replacement and, from this new data set, a tree is drawn. For each node in the original tree, we ask whether the new tree has the same cluster of sequences. The whole operation of resampling the data, drawing a tree, and tallying up nodes that are in the original tree is repeated perhaps 1000 times, and the final result is displayed graphically as a numbpr next to each node indicating the percentage of time that cluster is present among the resarnpled trees. If that fraction is high, then one gains confidence that the given cluster actually belongs together. Another means of testing the statistical confidence in a tree is to test the null hypothesis that each interior branch has length zero. From distance methods, we often obtain estimates of all branch lengths in the tree, along with their standard errors. If we fail to reject the null hypothesis of zero length for an interior branch, then we lose confidence in the nodes surround- ing that branch. Molecular Population Genetics 373 Shared Polymorphism One might intuitively expect that all the alleles of a species should cluster together on a gene tree, implying that the common ancestor of all the alleles is an ancestral allele within the same species. A few gene trees have been found to have (he unexpected property that alleles in two or more species appear to be interdigilated on the tree. This pattern, known as shared! polymorphism or trans-species polymorphism, has been observed in major histocompatibility alleles in primates (Lawlor et al. 1988), in self-incompatibility alleles of plants (loerger et al. 1991), and in several genes in the nwlanogastcr species subgroup of Dwsophiln (Hey and Kliman 1993). Figure 8.27 shows the probable means by which shared polymorphism arises, namely, that the ancestral species was polymorphic, and two or more alleles remain in the descendant species ever since the time of the common ancestor. Recall that the expected fixation time for a new mutation, given that it goes to fixation, is 4/V generations. This means that neutral alleles are quite unlikely to remain polymorphic for much longer periods. Consequently, observation of shared polymorphism implies that either strong selection is retaining the alleles in the population, or that the species have diverged relatively recently. In the first two examples above, there is good evidence that selection has maintained the polymorphisms, while in Shared polymorphism Figure 8.27 Trans-species or shared polymorphism may occur if the ancestor was polymorphic for two or more alleles and if alleles persist to the present in both species. 374 Chapters the third example, the Drosophila species are recently enough diverged that some shared neutral polymorphisms are expected. Interspecific Genetics Phylogenelic inference from molecular sequences is a descriptive goal in the sense that the primary objective is to obtain an accurate representation of the ancestral history of the species. Population genetics can also address the genetic basis for species differences, particularly in the case of species in which some hybrids are at leasl partially fertile. Although these studies do not directly address the genetic causes for species origination, they are rele- vant to the genetic causes of barriers to interspecific gene flow. Investigation of the genetic basis for hybrid infertility and inviability among species in the Drosoplitla tiuitmogaster species subgroup (comprising the species melan- ognster, simulant, secheUia, and mmmliana) is a very active area. One focus in this work has been an investigation of the genetic basis for Haldane's rule, which states that, in interspecific hybrids in which only one sex is sterile or inviable, the sex likely to be affected is the heterogametic sex (Coyne 1985; Coyne el al. 1991). Rather than one or two genes of large effect, interspecific hybrid sterility appears to be caused by many genes that also have a complex pattern of interaction, so that some particular combinations are sterile and other combinations arc fertile (Palopoli and Wu 1994). A powerful tool for studying the genetic basis of hybrid sterility has been to introgress small pieces of the genome from one species into the other. By doing this for many regions distributed all over the genome, one can learn about the relative roles of the X chromosome and autosomes, the relative incidence of male vs. female infertility, and so forth (True et al. 1996). Other features of interspecific differ- ences are amenable to genetic analysis by either introgression methods (applied to differences in cuticular hydrocarbons by Coyne 1996) or by scor- ing an array of anonymous markers in many backcross individuals (applied to genital arch morphology by Liu et al, 1996). MULTIGENE FAMILIES Genes increase in number through duplication. Several successive rounds of duplication result in a family of homologous genes with related functions, a multigerte family, the members of which are often arrayed in tandem along the chromosome. Among genes that normally exist in tandemly arrayed multigene families are the rRNA genes and the histone genes. Analysis of the sequences of members of multigene families has led to some interesting sur- prises. Figure 8.28 shows a scenario whereby a gene underwent a duplica- tion that ultimately became tixed in the population either through drift or selection. Subsequently, sufficient sequence divergence occurred that the two genes could be distinguished. Later a speciation event produced two differ- Molecular Population Genetics 375 / X^ Duplication | Divergence Time 1 Time 2 Figure 8.28 Multigene families originate by a process of gene duplication After the duplication the genes may retain very similar functions (like rRNA genes), or they may diverge (like globin genes). Tf the species splits into two species, then time 1 and time 2 depict the relationship between the genes short- ly after speciation and long after speciation (see Figure 8.29). ent species sharing this pair of genes. Figure 8 29 shows the gene genealogies at two time points in the evolution of this gene family At time 1, the A genes in species 1 and 2 have a more recent common ancestor than do genes A and B within species 1. At time 2 the pairs of genes present in the same species are more similar. This is the pattern that is observed in some multigene fam- ilies The close resemblance of Aj with B h and of A 2 with B 2 , seems paradox- ical, since both species have the duplication, and Figure 8.28 makes it appear that genes A x and A 2 in the two species have a more recent common ances- tor than do genes A } and B v Genes A, and B,, as well as A 2 and B 2 , may have more similar sequences because the genes evolve together, in concert, under the influence of mechanisms that operate to homogenize their sequences. This tendency toward homogenization is known as concerted evolution. Causes of Concerted Evolution Two important mechanisms of concerted evolution are gene conversion and unequal crossing-over Gene conversion is a process in which nucleotide pair- ing between two sufficiently similar genes is accompanied by the excision of S76 Chapter 8 Time 2 Figure 8.29 Referring to Figure 8.28, ai Time 1, genes ,4, and A 2 in the two species are more similar to each other than either is to gene B, and likewise B, and B z are closest neighbors This tret? reflects the fact that the common ancestor of A i and A 2 is more recent than that of A , and B,. If at Time 2 a tree like the bot- tom panel is observed, then sequences of A x and B, have become more similar, possibly by the process of gene conversion The bottom tree illustrates the phe- nomenon known as concerted evolution. all or part of the nucleotide sequence of one gene and its replacement by a replica of the nucleotide sequence from the other gene. Formally, the result is that the sequence in one gene "converts" the sequence in the other gene to be exactly like itself. In unequal crossing-over, meiotic pairing between the tandem repeats in homologous chromosomes is out of register, and crossing- over results in an increase in the number of copies in one chromosome and a corresponding decrease in the number of copies in the other chromosome. Repeated rounds of unequal crossing-over can result in the disproportionate representation of certain sequences among members of the multigene fami- ly, a result that is formally equivalent to gene conversion. A theoretical model of concerted evolution has been studied by Ohta (1982). In this model, a tandemly arranged multigene family consists of a fixed number of n members, and X is the probability that a particular member of the gene family becomes converted by another member in any one gener- Molecular Population Genetics 377 ahon (Equivalently, X is the probability of completion of a cvcle of unequal crossing-over resulting in the replacement of one sequence in the tamilv by another) The mutation rate per copy is u, and the population number is N. In a tandemly arrayed multigene family, there are three distinct types of identity by descent (IBD) among the gene copies (Figure 8 30): 1 . Genes at different positions in the same chromosome may be IBD (probability Cj). 2. Genes at different positions in different chromosomes may be [BD (probability c 2 ). 3. Genes at the same position m different chromosomes may be IBD (probability/). Complex formulas for the equilibrium values of C\, c 2 , and /"have been derived by Ohta (1982), but they are greatly simplified when recombination within the gene cluster is ignored. In such a case, the equilibrium values are approximately C\ = c 2 = f = A + (n-l)u 4N\c 2 +l 4NX 4 4A/M + 1 8.18 In Equations 8.18, the quantity (n - l)u is very nearly equal tonu if n is reasonably large. Because it is the number of copies of the gene in each tan- dem array, nu is the total rate of mutation in the multigene family, summed across all copies, Thus, the implication of Equation 8.18 is that there is a deli- cate balance between the rate of gene conversion X and the total mutation rate nu. If the rate of gene conversion is much greater than the total mutation rate, then the probability of IBD of genes at different positions within the <* 1- 1 "t 1 I I I I 1 s 1 ^\^2 e i ' i l 1 1 1 1 1 1 t Figure 8.30 Three types of identity by descent in multigene families. They are the identity between genes at homologous sites (probability f), between genes nt nonhomologous sites in the same chromosome (probability c t ), and between gene at nonhomologous sites in different chromosomes (probably c 2 ) (After Ohta 1982.) 378 Chapter 8 family ff| and c 2 ) is close to 1.0. On Ihe other hand, if X is much smaller than the total mutation rate, then the probability of IBD of genes at different posi- tions within the family is close to zero. Concerted evolution does not homogenize all multigene families. Depending on the balance of the forces of mutation, gene conversion, and unequal crossing -over, the pair of genes may remain active and very similar, or they may diverge in function (such as different tissue-specific forms of amylase or lactate dehydrogenase), or one gene may lose function and become a pseudogene Multigene families can avoid the accumulation of mutations when there is sufficiently strong natural selection, and positive selection is necessary for genes to evolve new functions. Walsh (1988) addressed the question of genes within a family escaping from gene conver- sion, and he showed that higher mutation rates and lower conversion rates lead to greater likelihood for a gene escaping conversion. Once a gene is suf- ficiently divergent to have escaped conversion, it can either lose function and become a pseudogene or it can acquire a new function. Simple models of such a duplicated gene show that very little selection is needed in a large population to avoid a pseudogene fate (Walsh 1995). Multigene Family Evolution through a Birth and Death Process Duplicate genes can evolve in separate ways under the influence of natural selection, mutation, and random genetic drift. In time, some members of a multigene family may diverge to a greater or lesser degree in their function This process of duplication and divergence is thought to be the major mech - anism by which genes with novel functions are created. Some multigene families retain a tandemly arrayed structure and similarity in function across members despite the fact that the differences between individual members is of functional significance. This pattern is particularly true of genes in the immune system, including immunoglobulin genes and major histocompati- bility genes Interspecific comparisons of genes in families of this sort exhibit some genes that are clearly homologous, and others thai are more distantly related. In addition, the rate of duplication, loss of function through pseudo- genes, and loss by deletion, may be fairly high. This kind of pattern of multi- gene family evolution is different from concerted evolution, because the dif- ferences between the genes can be high enough that intergenic conversion is very rare. Figure 8 31 illustrates the distinctness of this pattern of gene evo- lution, called a birth-nnd-death process by Ota and Nei (1994). Figure 8.32 illustrates the result of duplication and divergence in two related multigene families in mammals that code for the a-like and p-like polypeptide chains of hemoglobin The genes are specialized for different periods of life. The e (epsilon) genes are expressed in embryos; the 'y and y genes and the a. genes in the fetus; and the a, p, and 6 genes in Ihe adult. The inference from differences in nucleotide sequence is that the original (A) Concerted evolution (B) Divergent evolution Molecular Population Genetics 379 (C) Evolution by birth and death process Species 1 Species 2 Species 1 Species 2 Species I Species 2 Figure 8.31 In addition to concerted evolution and simple divergent evolu- tion, multigene families frequently exhibit the phenomenon of genes being added and lost to families by a "birth and death process." (From Ota and Nei 1994.) a-p duplication took place approximately 500 million years ago, when ver- tebrates were represented by the bony fishes, and the p-y duplication took place about 80 million years ago, during the mammalian radiation. More recent duplications have also occurred, for example those leading to the two functional a genes, the cluster of three ot-Iike pseudogenes, and the two y genes. There are several models for the sequence of duplication, deletion, and conversion events that could have led to the current array of globin genes (Goodman et al. 1984; Hardies et al. 1984; Hardison 1984; Margot et al. 1988), but it appears well substantiated that the ancestral cluster that pre- dated the mammalian radiation was 5'-eyr|8p -3' Within the mammalian radiation, the different orders of mammals evolved along different routes. In prosimian primates, such as lemurs, there was a fusion ol r| and 8. In higher primates, including humans, there was a §-p conversion and a y duplication. In rodents, p and y both duplicated, q was deleted, and there was a 8-P fusion, mediated probably by an unequal crossover. In rabbits, r| was delet- ed and there was a 8-P conversion. Finally, in goats, y was deleted, there was a 5-P conversion, and the remaining four gene array was then tripli- cated! The evolutionary history of the fetal globin genes in humans reveals that the G y and A y genes originated as part of a relatively recent 5 kb tandem duplication (Shen et al. 1981). Furthermore, evidence from nucleotide a 380 Chapter 8 Mammalian ancestor — [~J I ' Eutherian ancestor f 7 n * P e y 5 p 5 P E v ph() phi ph2 ph3 (Jl P2 \ / W , tmbryo Fetal and adult Mouse __ -D-■^l^^-^^^^Il-■-D-»-^I^-»- e 1 f" yp x p c e ii[ e ,v yp 7 - P A e v t vl V[1 Y P 1 Maisupials Embryo juvenile Adult Goat Fetal f Gy Ay v|>T| o p I \ / \ I Embryo Fetal Adult Human Figure 832 Reconstruction of the (J-globin sequences in a series of mammals illustrates the complexity of duplication, loss, and gene conversion in this multi- gene family. (After Hardison 1984 ) sequences strongly suggests that a gene conversion event also occurred, which converted part of one particular A y allele into a G y allele (Slightom et al. 1980). The converted A y allele is very similar to a G y allele for about 1550 bp on the upstream (5') side of a putative recognition signal for gene conver- sion {a stretch ol repeating TG and CG dinucleotides); but on the down- stream (3') side of the putative signal, the converted A y allele is typical of other A y alleles in the human population. The A y to c 'y gene conversion occurred much more recently than the duplication resulting in the close sequence similarity of the A y and °y genes. The estimate of the time of occurrence of the A y- G y duplication can be improved by using the nucleotide sequence data from the entire duplicated 5 kb region. In the entire region, 14% of the nucleotide sites differ, which translates into k = 0.155 ± 0.006; this suggests a time for the duplication of 0.155 x 100 x 2.2 x 10 ft = 34 million years (Shen et al. 1981). Molecular Population Genetics 381 Unequal crossing-over in multigene families can result in a decrease in the number of genes as well as an increase It is therefore not surprising that deletions of one or more of the hemoglobin genes are found in most parts of the world. Although usually very rare, in a few places the frequency of the delelions reaches levels too great to be accounted (or by chance, espe- cially in view of the observation that the carriers are mildly to severely ane- mic. Although a deletion of the p-gene results in death when homozygous, a p deletion and other mutations that decrease the abundance of the p-hemoglobin chain are relatively common in the Mediterranean Sea basin where malaria is endemic. For this reason, the decreased -p-chain diseases are called ^-thalassemias (literally translated as "sea-anemias")- The well- established link between sickle-cell anemia and malaria, along with the geo- graphical correlation between the P-thalassemias and malaria, provides a strong circumstantial case for malarial parasites being an important selective agent Deletion of one or more of the a-globin genes results in another form of anemia called a-thalassemia, whose frequency in populations is also cor- related with the incidence of malaria. Red-green colorblindness is a common X-linked disorder with a frequen- cy of about 8% in Caucasian males. The genes for the red and green visual pigments match at 98% of their nucleotides, indicating that they arose by a relatively recent duplication. Individuals with normal color vision have one copy of the red pigment gene and varying numbers of copies of the green pigment gene. When genomic DNA from colorblind males was analyzed by Southern blotting, those defective in green vision were lacking fragments of the green pigment gene. Further analysis showed that 24 of 25 colorblind individuals had lost one or the other pigment gene through gene rearrange- ments that were due either to unequal crossing-over or gene conversion. In this example, the high sequence similarity of the red and green pigments works to human disadvantage by greatly increasing the likelihood of exchange events that lead to loss of color vision (Nathans et al 1986). The relationship between the molecular basis of light absorption and perception was made particularly clear when it was found that a normal polymorphism in red pigments, which confers a difference in the absorption peak of the pro- tein product, also confers a measurable difference in the perception of color balance (Merbs and Nathans 1992). Duplication of genes also occurs in plants, including a particularly impor- tant gene in plants that encodes the carbon fixing enzyme ribulose-l,5-bis- phosphate carboxylase (RBC) (Clegg et al 1997) The functional RBC holoenzyme consists of eight large and eight small subunits. Early in plant evolution, both the large and small subunits of RBC were encoded by the chloroplast genome, but the small subunit gene was transferred to the nuclear genome at an early stage and has now been lost from the chloroplast genome. Diploid angiosperms contain from two to eight copies of the gene 382 Chapter 8 for the small RBC sublimit (rbcS). All copies of rbcS appear h> be functionally equivalent, and sequence analysis shows that the genes thai are closest together in the genome are also generally more similar in sequence In sequence comparisons among rbcS genes of tobacco and tomato, homologous genes compared between the two species are more similar than within species comparisons of gene copies. This finding is not the pattern expected under concerted evolution. The variable number of loci across angiosperms suggests that gain and loss of gene copies occurs to give a pattern like the birth-and-death process described above Structural RNA Genes and Compensatory Substitutions Transfer RNA and ribosomal RNA molecules derive their biochemical prop- erties from the secondary structure into which they fold. We are still learning the chemical rules by which such macromolecules attain their final folded configuration, but one thing that is very clear is that complementary base pairing is important. The stems of tRN As are critical to maintaining the tight- ly folded structure of these essential molecules. Substitutions that occur in stems will weaken the stability of the stem unless there is a compensatory change on the other strand that maintains base pairing. Kimura (1985) real- ized that one could obtain evidence for such compensatory changes. More recently, such compensatory changes have been demonstrated in an intron, demonstrating that the folding structure of introns may also be important to regulating gene expression (Kirby et al. 1995). Further evidence of the importance of secondary structure of rRNA comes from analysis of rRNA pseudogenes in plants (Buckler et al. 1997). One attribute of secondary structure is measured as the difference in free energy attributable to complementary base pairing in the folded vs. unfold- ed state. Computer predictions of the best folding structure of the rRNA pseudogenes suggested that the difference in free energy decreases as the sequences accumulate substitutions. Tests of randomly permuted sequences showed that the functional rRNA sequences are significantly more stable than would be obtained by chance, whereas predicted pseudogene RNAs are not Some introns have a significantly open secondary structure, such that random substitutions in their sequences result in more stable structures (Leicht et al. 1995) The reason some introns retain an open structure maybe for access to regulatory proteins. This possibility has been indirectly demon- strated by showing that stable stems inserted into introns in yeast can disrupt normal splicing. The ribosomal RNA gene cluster in Drosophila mefanogastcr consists of about 200 copies of a repeated unit on both the X and the Y chromosome, with each repeated unit containing an 18S and a 28S rRNA gene separated by an intoTgenic sequence (IGS) (Glover and Hogness 1977). The rRNA genes provide a clear example of concerted evolution because of great interspecific Molecular Population Genetics 383 differences in spite of a high degree of sequence conservation within species (Coen et al 1982). Furthermore, within individuals of D. mcrcnlontm, there appears to be little sequence variation, yet there are clear differences between individuals due to length variation in the intergenic sequence (Williams et al 1985). This finding suggests the operation of a strong homogenizing force maintaining sequence fidelity within individuals. In humans, the rDNA repeat consists of a 13 kb transcribed portion and a 31 kb spacer (Wellauer and Dawid 1979). This repeated unit is present in about 300 copies located near the tips of the short arms of five nonhomologous chromosomes. Despite the dispersed locations, concerted evolution still occurs as evidenced by much less variation among sequences within an individual than among species. Interchromosomal exchange events would lead to conservation of sequence distal to the rDNA cluster on each chromosome, and evidence for this conservation has been found (Worton et al 1988). Muitigene Superfamities In some cases, several sets of muitigene families and single-copy genes may share recognizable homology, implying a common ancestry, but they have undergone major divergence in function and relocation of position within the genome These sets of historically related but functionally distinct genes constitute a muitigene superfamily. The remarkable similarities found among portions of genes in related gene families has suggested that many proteins have functional modules that can be combined in various ways in what is called exon shuffling. One example of shuffling is found in tissue plasminogen activator (TPA), which has portions of three other proteins, including plasminogen, epidermal growth factor, and fibronectin. The striking finding is that the junctions of these protein segments fall precisely at intron-exon junctions. The epidermal growth factor shares exon similarity with several other proteins, including blood clotting factors IX and X, urokinase, and complement C9 (Doolitlle 1985). The gene for the low-density lipoprotein (LDL) receptor in human beings extends over 45 kilobases and contains 18 exons that show similarity to a bewildering variety of other proteins, including epidermal growth factor and blood clotting factors (Sudhof et al. 1985). Just as a computer program- mer recognizes the value of reusing subroutine modules in different pro- grams, nature has capitalized on the efficiency of modular gene organization. One extensively studied muitigene superfamily that serves diverse func- tions in immunity is illustrated in Figure 8.33 (Hood 1985; Hunkapiller and Hood 1986). The primordial single-copy gene may have coded for a cell-sur- face receptor containing the basic homology unit of the superfamily, which is about 110 amino acids in length with a strategically placed disulfide bridge and folding characteristics enabling it to combine with other similar units. An early duplication and divergence of the primordial gene resulted in the r 384 Chapter 8 CD4 OX-2 (. I"W (T.TT4) (1 5*2.1) X~2 ^ & 5?-^: T7 11, 1\ Thy1 N<AM NC3 pcily-1 s Oncoprotein Ifravy I.ij>!i1 (5 « Immunoglobulins- rtioptor* Multi^rnp families Figure 8.33 Proposed evolution of the immunoglobulin multigene superfami- ly from a primordial gene coding for a cell-surface receptor. Details of the evolu- tionary relationships are speculative. The superfarnily has diversified into 12 single-gene representatives (all of those at the left, plus f^-microglobulin — p 2 -m — at the right), and eight multigene families (remaining representatives at the right). These include genes for antibodies, T-cell receptors, major histocom- patibility antigens, and other functions. The single-gene members include T-cell molecules implicated in MHC recognition (CD4 and CD8) and possibly ion channel formation (T35, T3e), an immunoglobulin-) ransport protein (poly-Ig), a plasma protein («iJ3-gIycoprotein), two molecules restricted to lymphocytes and neurons (Thy-1 and OX-2), two brain-specific proteins (N-CAM and NCP3), and [^-microglobulin. The multigene families include the heavy (H) and light (k, X) components of antibody molecules, the a, 0, and y chains of T-cell receptors, and the Class I and Class II molecules from the major histocompatibility complex (HLA). (Adapted from Hood etal. 1985 and Hunkapiller and Hood 1986.) Molecular Population Genetics 385 variable (V) and constant (C) domains that have been so versatile in their diversification for specialized immune functions. In some members of the immunoglobulin superfarnily, shown at the left in Figure 8.33, the functional products are usually individual polypeptide chains, sometimes containing internal duplications of the primordial folding unit. These products include the poly-Ig receptor that mediates the transport of immunoglobulin mole- cules across cell membranes. In the other main branch of the superfarnily, shown at the right, the func- tional products are usually aggregates of polypeptide chains. In this branch, there occurred multiple duplications of the V regions and specialization of D (diversity) and J (joining) regions during the evolution of the DNA splicing mechanism in lymphocytes, which today results in the tremendous diversity of antibodies and T-cell receptors. During the formation of heavy-chain anti- body genes in the lymphocytes, any one of a large number of DNA sequences coding for the variable part of the molecule can become spliced with any one of a small number of DNA sequences coding for the constant part, with diversity and joining regions incorporated in between. The many possible V-D-J-C combinations enables enormous numbers of different pos- sible antibodies to be formed, which is increased still further by slight varia- tion in the exact positions of the splice junctions. An analogous type of splicing process occurs in the formation of antibody light-chain genes and T- cell receptor genes. In yet another offshoot of the immunoglobulin superfarnily, shown at far right in Figure 8.33, the C region underwent duplication and specialization to form molecules of the major histocompatibility complex (MHC), which, among other functions, are necessary for the T cells of the immune system to recognize foreign antigens. Complete sequencing of a 100 kb region of the T- cell receptor gene family has revealed a spectacular degree of sequence conser- vation between human and mouse (Koop and Hood 1994). The opportunities for exceptionally detailed analysis of multigene family evolution have enlarged with genomic sequencing methods already producing the complete sequence of entire arrays of genes (Rowen etal. 1996). Although many aspects of the immunoglobulin superfarnily tree in Figure 8.33 are speculative, the molecules are undoubtedly related because comparison of the relevant units gives 15 to 40% homology at the amino acid level, and at the DNA level each homology unit is encoded in a separate exon. The immunoglobulins thus demonstrate the immense evolutionary potential of repeated rounds of duplication and diver- gence through specialization of function. Dispersed Highly Repetitive DNA Sequences A second major class of highly repetitive DNA in eukaryotes is not localized in clusters of tandemly repeating units, but is dispersed throughout the 386 Chapter 8 genome with single-copy sequences. The importance of dispersed repetitive elements to the human genome project is made clear by the realization that they constitute 35% of our genome (Smit 1996). In vertebrates, this dispersed highly repetitive DNA occurs primarily in two categories, denoted SFNEs and LINEs (Singer 1982). SINEs (short interspersed elements) are sequences typically shorter than 500 base pairs which occur in 10 s or more copies in the genome. Like tRNA genes, they contain internal transcriptional start sites and are transcribed by RNA polymerase III. LINEs (long interspersed ele- ments) are sequences typically greater than 5000 base pairs that occur in 10 4 or more copies in the genome. They are processed pseudogenes (see below) and, when transcribed, are transcribed by RNA polymerase II. Marked dif- ferences in the particular array of subfamilies of SINEs and LINEs or both are frequently observed among even closely related species (Figure 8.34). The mechanisms and possible significance of such massive and rapid changes in repetitive DNA in the genome are very obscure. One example of SINEs in human DNA is the Alu family, named because the sequence contains a characteristic restriction site for the restriction enzyme Alu\. The Alu sequence is about 300 nucleotides in length Alu sequences are present in approximately one million copies in the human genome and constitute approximately (en percent of the total DNA (Smit 1996). Sequences closely related to Alu are found in other primates, and more distantly related sequences occur in rodents and probably in all placental mammals. Two randomly chosen human Alu sequences differ, on the aver- age, at 15 to 20% of their nucleotide sites, which calculates to a time of diver- gence of between 16 7 and 23.3 million years. In the human genome there is an Alu element an average of every 3 to 5 kb, but the distribution is not uni- form. For example, the ^-tubulin and thymidine kinase gene regions have about 10 times the average density of Ahi repeats (Slagel et al. 1987), and Alu lepeats show a preference for integrating into oligo-dA runs (Daniels and Deininger 1985) PROBLEM 8.1 1 The third chromosome of Dwsophila pseudoobscura is polymorphic for more than a dozen inversions that result in differ- ent gene orders. Polymorphisms of this sort are different from nucle- otide site substitutions because they retain some information about the order of events. Consider, for example, the sequences A-B-C-D-E and C-E-A-D-B. Can you deduce the order of the events that connect them? Molecular Population Genetics 387 Human iIZF-^<h f£h- V . ,-.- ., _ ; -i ••!:-. i. _i.' \n \t .•' ■ -j V ■ ;. . .\ ' > -\ 1 2 , v-'. " 'w : * V ■ "■ i ' » i i. ;■ '•"-■'"..;' ?• * T . • ■ '> .*. . ■' ■: '- 'A 2 4 6 8 10 12 Kilobase pairs 5 a. 6 8 re -O _o 7 S 8 It) 11 12 Figure 8.34 A dot plot comparison of the human and rabbit sequences span- ning 5- and P-globins Each dot represents a small bit of sequence similarity, much of the background due solely to chance, and the regions of extended simi- larity stand out as diagonal line segments. The scales are in kilokises, and the rectangles indicate the location and organization of the globin genes The solid arrows show the location of a rabbit LI repeat, and open triangles indicate human Alu sequences and rabbit OcC repeats (a rabbit SINE) The major diago- nal line indicates that there is noticeable homology retained Ihrough theS-B intcrgenic region, and the sequence similarity of human (3-globin with rabbit 5- globm (and vice versa) is evident. (From Margot et al. 1988.) 388 Chapter 8 ANSWER From A-B-C-D-E, the first inversion must have been the segment A-B-C, giving the sequence C-B-A-D-E, Next, the segment A-D-E inverted to give C-B-D-A-E. Finally, the segment B-D-A-E inverted to give C-E-A-D-B. Much more elaborate problems of infer- ence have arisen to determine the ancestral series of inversions and number of events needed to go from one gene order to another. Com- puter scientists refer to this problem as "sorting by reversals." You can see that given any random ordering of integers, a finite number of inversions or reversals will put them into the correct order. Motivated by the biological problem, an algorithm for finding the minimum number of reversals to go from one order to another was recently implemented (Bafha and Pevzner 1996). As more genomes are fully mapped and sequenced, this is likely to be an area of considerable excitement. Ehrlich et al. (1997) recently estimated that the number of rearrangements that were required to connect the human and mouse genetic maps as about 180. An example of LINEs in the human genome is the LI family of sequences (also called LINE-1 or Kpn, because of a characteristic restriction site). The LI sequences average about 2,000 nucleotides, and the 50,000 copies of the sequence in the human genome account for about 4% of the total DNA. As with the Alii family, sequences related to LI are found in other mammals, including the mouse {Hardies et al. 1986) and the rabbit (Demers et al. 1986). Not all insertions of LI sequences are innocuous. Kazazian et al. (1988) found two cases of hemophilia A that were caused by de novo insertions of an LI sequence into exon 14 of the factor VI II gene, whose function is necessary for normal blood coagulation. This insertional mutation event was evidently mediated by an RNA intermediate and provides a mechanism for natural selection to operate on LJ elements. Another deleterious mutation caused by a transposable element in humans was an insertion of an LI sequence into the jhi/c oncogene in a human breast cancer (Morse et al. 1988). In their molecular organization, LINE sequences strongly resemble a class of pseudogenes known as processed pseudogenes. Processed pseudogenes are thought to result from the reverse transcription of an RNA molecule into DNA, followed by insertion of the DNA into the genome. The reverse tran- scription and integration process can be carried out by an enzyme called reverse transcriptase, which is coded in the genome of a class of RNA- containing viruses called retroviruses In cells infected with retrovirus, the Molecular Population Genetics 389 reverse transcriptase makes n DNA copy of the viral RNA, and another enzyme inserts the DNA into the chromosome. When reverse transcription and integration happen to a processed RNA molecule, the result is a dispersed duplicate copy that is generally transcriptionally inactive due to loss of regu- latory sequences. Such a sequence is known as a processed pseudogene. Many genes are known to have processed pseudogene counterparts, includ- ing the genes for human K-immunoglobulin and (3-tubuIin, rat a-tubulin and cytochrome c, and mouse a-globin Not all genes that have been processed through an RNA intermediate are pseudogenes. Human phosphoglycerate kinase (PGK) occurs as an active X -linked gene, a processed X-hnked pseudo- gene, and an autosomal gene with remarkable properties. The norma] PGK-1 gene contains 11 exons and 10 introns, but the autosomal gene has no introns and has remnants of a poly-A tail, strongly implying thai it was reverse tran- scribed from an RNA transcript. The intron-free autosomal gene (PGK-2) is expressed in human testes (McCarrey and Thomas 1987). The processed pseudogene model of dispersed repeated DNA evolution is illustrated in Figure 8.35 (Hardies et al. 1986). The functional, transcribed copies of the gene family are shown at the top, and the horizontal arrows rep- resent gene conversion, which promotes concerted evolution of the function- al genes. The gene in the center is a preferred donor for gene conversion {biased gem conversion). Emanating from the functional genes are numerous L_ Processed pseudogenes Mutation, random genetic drift, deletion Figure 8.35 Model for the evolution of a dispersed highly repetitive family of processed pseudogenes. A small number of functional genes (top), which undergo concerted evolution by means of gene conversion, are transcribed under conditions that favor reverse transcription and integration into numerous dispersed chromosomal locations. The resulting nonfunctional genes undergo mutation and random genetic drift, and are ultimately eliminated by deletion or other mechanisms. (From Hardies et al. 1986.) 390 Chapter 8 copies of processed pseudogenes distributed throughout the genome. These copies are essentially functionless and undergo sequence divergence pro- moted by mutation and random genetic drift, which is offset in part by gene conversion and other homogenizing processes among the pseudogenes. Eventually the pseudogene sequences are cleared from the genome by dele- tion or extreme sequence rearrangement or divergence. One implication of the model in Figure 8.35 is that, eventually, a balance is reached in which the clearance of old pseudogenes from the genome is equaled by the creation and insertion of new ones. In the equilibrium state there is a steady turnover among sequences in the family, but the total num- ber neither grows nor shrinks. Studies of a dispersed repeated sequence in the mouse related to human LI suggest a turnover with a half-life of approx- imately two million years. That is, after two million years, half the members of the gene family will have been removed and replaced with new ones. However, the LI family may evolve more rapidly than is typical. The very abundance of pseudogenes implies that many unrelated genes may have pseudogenes in the same vicinity, as is the case with Alu sequences interspersed in the (3-globin cluster. Some fraction of these linked pseudo- genes may alter the level, timing, or tissue distribution of transcription of the genes to which they are linked, or they may have subtle effects on chromatin structure that affect gene expression. Through any of a diversity of mecha- nisms, pseudogene copies of dispersed highly repeated gene families could, in principle, have effects on phenotype and thus be subject to the influence of natural selection. While true in principle, such effects have not yet been demonstrated. To the extent that such effects can safely be ignored, the evo- lutionary mechanism of highly dispersed repeated DN A sequences is that of selfish DNA, subject to the conflicting forces of neutral mutation/random drift and the diverse homogenizing processes of concerted evolution. SUMMARY The discipline of molecular population genetics has as its theoretical foun- dation the neutral theory, which provides a rich set of testable hypotheses about the mechanisms that modify patterns of sequence divergence and sequence polymorphism. We saw that underlying models must be specified even to do seemingly straightforward things like estimating rates of substi- tution. The reason substitution rate estimates are not trivial is that, with greater divergence, subsequent mutations may not further increase the divergence if the site has already been substituted. From observed counts of amino acid or nucleotide differences, we usually want to estimate numbers of changes per site. The model for amino acid substitution is not very diffi- cult because there are 20 amino acids, but even the simplest nucleotide sub- stitution model of Jukes and Cantor is subtle. More complicated models Molecular Population Genetics 397 account for differences in rates of transition and transversion substitutions, and it immediately becomes apparent that both the process of mutation and of substitution can be of any imagined degree of complexity Out of sequence analyses there emerges the pleasing generalization that many sequences appear to diverge at an approximately clock-like rate. This molecular-clock concept should be interpreted somewhat loosely, because rigorous statistical tests have identified significant irregularities in its rate. In addition, there are dramatic differences in rate of evolution across genes, because the neutral substitution rate differs from one gene to the next Some lineages appear to have accelerated or decelerated clock rates, and one cause for the variation is a change in generation time (for example, from rodents to primates). Synonymous and nonsynonymous substitutions have different effects on the protein product, so estimating the rates ol these two kinds of substitu- tion independently can be informative about the causes of evolutionary change. For example, most genes, like Drosophtia Adfi, have a large excess of synonymous changes, an observation that is accounted for by the presumed deleterious effect of most amino acid replacements. All synonymous codons are not used with equal frequency, and the bias in codon usage implies that even synonymous substitutions may not be selectively neutral. The most sen- sitive tests for selection make use of comparisons between intraspecific poly- morphism and interspecific divergence. Under strict neutrality, these two quantities should be related to one another, and departures in either direction can be detected through heterogeneity among genes. The neutral theory also makes predictions about the shape of gene trees, and there has been a great deal of excitement about the possibility of testing hypotheses about evolutionary forces based on inferred gene genealogies. (Problem 8.8 gives one example,) Gene trees have been used to test hypothe- ses about selection, recombination, homogeneity of mutation, and even migration The ability to account for the patterns of correlation built up by the ancestral history of genes has been a major advance in statistical popula- tion genetics. Organelle genome evolution occupies an important position in the devel- opment of molecular population genetics, in part because of the numerous studies of mtDN A and cpDNA variation. Of particular interest and contro- versy was the work on human mtDN A variation, which raised many intrigu- ing problems about human origins. This work stimulated a huge amount of theoretical study concerning the statistical inferences that could be made from sample data, including times of common ancestry, inference of past demographic histories, and so forth. Several recent studies have shown that mtDN A exhibits patterns consistent with the past operation of natural selec- tion, in violation of many of these models. 392 Chapter 8 Molecular phylogenehcs seeks to reconstruct the ancestral history of extant organisms, and shares many analytical procedures with molecular population genetics. There are several widely used algorithms for recon- structing a tree from sequence data, and we examined in some detail the UPGMA method, least- squares, neighbor-joining, and parsimony methods. One of the more intriguing patterns of variation to emerge from such inter- specific comparisons is that of shared polymorphism, in which two or more species share a number of alleles in common. It is unlikely that shared poly- morphism would be maintained for long by chance, so it is not surprising that cases of shared polymorphism are generally found in genes known to be under strong selection or in species that have recently diverged. When multiple copies of similar genes exist in the genome, they can exchange sequences through unequal recombination and gene conversion. Such exchanges can result in concerted evolution, a process whereby genes in a multigene family are very similar to one another within a species, even though the duplication events that gave rise to the family occurred far in the past. Not all multigene families undergo concerted evolution. A more com- mon finding is that many multigene families exist as groups of genes with related function that have diverged enough in sequence to escape gene con- version. In this case, new genes appear by duplication and old ones disap- pear by deletion, sometimes preceded by inactivating mutations that generate pseudogenes. This birth-and-death process gives rise to complex patterns of relationships among genes within gene families. PROBLEMS 1. Suppose that you have sequences of gene A and gene B from each of two species. The fraction of sites that differ in gene A is 0.7 and the fraction of sites that differ in gene B is 05 Apply the Jukes-Cantor formula to obtain the estimate of the number of substitutions per site for each gene. Which gene do you think would have a smaller estimate of variance of substitution rate? Why? 2. Suppose you discover a community of deep sea creatures that have very unusual DNA that has not four bases but six Adenine and thymine pair, and guanine and cytosine pair just like most DNA, but there are also niti- dine and liondine, which also pair. You obtain sequences from two of these creatures and determine that 20% of the sites mismatch in aligned sequences. From this figure, estimate the number of substitutions per site that have occurred since the common ancestor of the two species. (Hint: You know that the number is higher than 0.20, because back mutations could have occurred. Derive an expression like the Jukes-Cantor formula.) 3. The following is a small portion of the gene coding for 6-phosphogIu- conate dehydrogenase in two natural isolates of E. coli. Molecular Population Genetics 393 1CTC ACC AAA ATC CCC GCC GTA GCT CAA GAC GGT GAA C<A !'.,( C.T1 ACC W AH GGT GCC 2CTC AAG CAG ATC CCG CCC CTT CO CAA GAC GCT GAG CM. Tel GIG ACT TAT AIA GGT GCC I ' ' Infer the correct translation^ reading frame of the sequences and esti- mate' a. the number of amino acid differences/site b the number of nucleotide differences/site. c. the number of non synonymous substitutions per nonsynonymous site (regarding codon sites 1 and 2 as nonsynomyous) d. the number of synonymous substitutions per synonymous site (regarding codon site 3 as synonymous). 4. In the human immunodeficiency virus HIV, which causes acquired immune deficiency syndrome (AIDS), the rate of nucleotide evolution has been estimated at about 0.01 substitutions per synonymous site per year. Two viruses isolated in 1983 in Zaire and San Francisco differ in approximately one third of their synonymous sites. Estimate the year in which the viruses last shared a common ancestor (Data from Li et al., 1988.) 5 The data below give the proportion of nucleotide sites that differ in a gene in four RNA viruses (Yokoyama et al. 1988). HIV1 and HIV2 are two rather distinct types of human immunodeficiency viruses, VISNA is a lentivirus, and MMLV is a mouse cancer-causing virus. Estimate the number of nucleotide substitutions per site using these data. What do the numbers imply about the evolutionary relationships among the viruses? HIV2 VISNA MMLV HIV1 0.34 054 62 HIV2 052 63 VISNA 63 6. What inference would you make regarding the selective constraints on a region of DNA in which the rate of evolution was 5 x 10~ 1 ' nucleotide sub- stitutions per site per year? 7. What might you infer about the evolutionary forces affecting a coding region in which the rate of amino acid replacement was greater than the rate of synonymous nucleotide substitution? 8. Ribsomal RNA forms a complex secondary structure in which many regions of the molecules are folded back and undergo base pairing with complementary nucleotide sequences elsewhere in the same molecule. What pattern of nucleotide sequence evolution might be expected in these paired regions? 9. What is the largest value of rf that makes sense in Equation 8.15 and what does it mean? 394 Chapter 8 10. If the rate of nucleotide evolution along a lineage is 0.5% per million years, what is the rate of substitution per nucleotide per year? What is the total rate of divergence of two lineages? 11. While analyzing the DNA sequences of two copies of a gene, you find that there are a total of 34 synonymous substitutions and 16 nonsynony- mous substitutions. Using the method of Nei and Gojobori, you find that there were 310 synonymous nucleotide sites and 633 nonsynonymous sites. If possible, estimate the rates of synonymous and nonsynonymous substitution, and interpret the result. 12. If the effective size of a diploid population is N with respect to autosomal genes, what is it with respect to a. X-Jinked genes? b. Y-Hnked genes? c. mtDNA? 13. Analysis of mtDNA in humpback whales (Baker et al., 1990, Nature 344:238-240) has shown that not only do the Atlantic and Pacific popula- tions show differences, but there are clear geographic subpopulations within oceans despite the lack of geographic barriers. Such a pattern may be observed if either: (1) there were a low rate of migration and a low rate of mtDNA sequence divergence, or (2) a higher rate of migration with a higher rate of mtDNA sequence divergence. Can you distinguish these two possibilities? Can you separately estimate the rate of neutral muta- tion and the rate of migration in a subdivided population? 14. Suppose the phylogeny of five species is ((A,B)R(C(D,E))|, where R des- ignates the root. Can you ascribe the substitution events of the following data uniquely to branches on this genealogy? Number the sites 1-10 and label the substitutions by site number on the tree. Species A TAG CTC ATC A Species B TAG CCG AGC A Species C TAC CCG ATT G Species D TAC CCT ATC A Species E TGC CCT ATC A 15. For an ideal population of effective size N, the average time to loss of a new mutation destined to be lost is 21n(2JV), and the average time to fixa- tion of a new mutation destined to be fixed is 4N. For what values of N does a. Fixation time = 10 x loss time? b. Fixation time = 100 x loss time? 16. For the model of gene conversion with gene identities given in Equation 8.18, what value of A, makes the organization of the gene family irrelevant m the sense that/= c, = c 2 ? What is the common value in this case? (k is Molecular Population Genetics 395 the probability that a particular member of the gene family becomes con- verted in any one generation.) 17. For the model of gene conversion with gene identities given in Equation 8.18, what are the values of /and C\ = c 2 when X = p? (The equations assume 4Nu « 1.) 18. In a repetitive gene family being eliminated from the genome by dele- tion, if the fraction of sequences present at time that are still present at time I equals exp(-ht), show that the half life of the sequences equals -ln(V 2 )A. 19. For a repetitive gene family eliminated as described in Problem 18, show that the average persistence ol an element is l/7(. CHAPTER 9 Quantitative Genetics Artificial Selection Heritability Components of Genetic Variance Genotype x Environment Interaction Threshold Traits Genetic Correlation Evolutionary Quantitative Genetics - QTL Mapping any IMPORTANT problems in evolutionary biology begin with observations of phenotypic variation. Darwin formulated his ideas about evolution by natural selection based on observations of phenotypic variation. He struggled for many years to explain the cause of the phenotypic variability, but he was unsuccessful at one level because he did not know about Mendelian genetics. Darwin did, however, appreciate the importance of the observation that offspring resemble their parents. Con- tinuously varying traits, like body size, are influenced by both genetic and environmental factors. Crossing experiments demonstrate that the genetic components of these traits are not determined by single genes because the offspring do not fait into discrete classes with simple Mendelian ratios. Instead, what is observed is a general resemblance between parents and off- spring, suggesting that there is an underlying genetic basis to the trait, but that the genetic transmission is complex. A wealth of statistical tools have been developed for analyzing such poly- genic traits that do not show simple Mendelian transmission. These approaches allow not only a description of the genetic basis of observed phe- notypic distributions, but they also provide a means of predicting the distri- butions of phenotypes among offspring from observation of the parental phenotypes. Most polygenic traits are influenced by the environment to varying degrees, and they are often called multifactorial traits to emphasize their determination by multiple genetic and environmental factors. For example, variation in human weight is partly due to genetic differences 397 398 Chapter 9 among individuals and partly due to environmental factors such as exercise and level of nutrition. The study of polygenic inheritance goes beyond an oversimplified nature-versus-nurture dichotomy because it is concerned with specifying, in precise quantitative terms, the relative importance of nature, nurture, and their interactions, in accounting for variation in phenotype among individuals. Another compelling reason to study polygenic inheri- tance is that natural selection occurs at the level of the composite phenotype, and so fitness is a multifactorial trait. Since natural selection operates on phenotypes, there arises an immediate problem in understanding how phenotypic evolution is reflected in changes that occur at the molecular level. One of the great challenges facing popula- tion genetics is to unify the principles of molecular evolution with those gov- erning evolution at the phenotypic level. TYPES OF QUANTITATIVE TRAITS Multifactorial traits may be considered as resulting from the combined effects of many quantities, some genetic in origin and some environmental, and for this reason they are often called quantitative traits. The study ol quantitative traits constitutes quantitative genetics. Three types of quantilative traits may be distinguished: 1. Traits for which there is a continuum of possible phenotypes are continu- ous traits; examples include height, weight, milk yield, and growth rate. The distinguishing feature of continuous traits is that the phenotype can take on any one of a continuous range of values. In theory, there are infi- nitely many possible phenotypes, among which discrimination is limited only by the precision of the instrument used for measurement. However, in practice, similar phenotypes are often grouped together for purposes of analysis. 2. Traits for which (he phenotype is expressed in discrete, integral classes are meristic traits; examples include number of offspring or litter size, number of ears on a stalk of com, number of petals on a flower, and num- ber of bristles on a fruit fly. The distinguishing feature of meristic traits is that the phenotype of an individual is given by an integer that equals the number of elements of the trait that the individual displays. For example, a popular meristic trait used in experimental studies of quantitative genetics in Drosophila is the number of bristles that occur on the abdomi- nal segments or sternites. Normally there are 14 to 24 bristles per sternite. A male with 19 bristles on the fifth abdominal sternite therefore has a phenotype of 19. The distribution of numbers of abdominal bristles in a sample of Drosophila appears in Figure 9,1. When the number of possible phenotypes of a meristic trait is large (as it is with abdominal bristle Quantitative Genetics 399 16 18 20 Number of bristJes Figure 9.1 Number of bristles on the fifth abdominal sternite in males of a strain of Drosophila meianogaster. The smooth curve is that of a normal distribu- tion with mean 18.7 and standard deviation 2.1. (Data from T. Mackay.) number) then the line between continuous traits and meristic traits becomes indistinct. 3. The third category of quantitative traits consists of discrete traits, which are either present or absent in any one individual. In these cases, the multiple genetic and environmental factors combine to determine an underlying risk or liability toward the trait. Liability values are not directly observable. However, an individual that actually expresses the trait is assumed lo have a liability value greater than some threshold or 400 Chapter 9 triggering level. Traits of this type are allied threshold traits, and exam- ples in human genetics include diabetes and schizophrenia With thresh- old traits, studies of affected individuals and their relatives permit inferences to be made about the underlying values of liability. These methods are discussed later in this chapter Quantitative traits are of utmost importance to plant and animal breeders, because agriculturally important characteristics such as yield of grain, egg production, milk production, efficiency of food utilization by domesticated animals, and meat quality are all quantitative traits. Even as modern methods of genetic engineering are applied to animal and plant improvement, quanti- tative genetics continues to play an important role because commercially desirable traits result from complex interactions among many genes. In addi- tion to being essential ingredients in plant and animal improvement pro- grams, the principles of quantitative genetics, appropriately modified and interpreted, can be applied to the analysis of quantitative traits in humans and natural populations of plants and animals. RESEMBLANCE BETWEEN RELATIVES AND THE CONCEPT OF HERITABIIITY For Darwinian evolution to be possible, a necessary feature of the transmis- sion of traits is that offspring must tend to resemble their parents. Even before the rediscovery of Mendel's work, Francis Galton was collecting detailed statistical data on resemblance between parents and offspring (Chapter 2). We will demonstrate the central ideas of the transmission of quantitative traits, using some of the concepts that Galton developed. Then we will show how models of Mendelian inheritance can account for these features of hereditary transmission. Calculation of the degree of resemblance among relatives in terms of underlying Mendelian genetics was first provid- ed by Fisher (1918). Fisher's paper, notoriously difficult, was of great histor- ical importance to population genetics, because it provided the first demon- stration that multiple Mendelian genes could account for the observed pat- terns of transmission of multifactorial traits. Figure 9.2 shows a plot of the mean of male offspring for a quantitative trait (y values) against the phenotypic value of the father (x values), dis- played in the way Galton devised. The line is the best-fitting straight line, called the regression line, of offspring on parent. Regression is relevant to one of the primary aims in animal and plant breeding, namely to be able to improve attributes of the stock. An essential part of generic improvement is to be able to predict what sort of offspring would be obtained from a given pair of parents. For quantitative traits, prediction cannot be done exactly, but a statistical description of the most likely offspring can be obtained by the pro- cedure of plotting the parent-offspring regression. For reasons that will Quantitative Genetics 401 £ & ■£■ 2 §£ 2 6 2400 2300 2200 2100 2000 I- 1900 J I 1 L 1800 1900 2000 2100 2200 2300 2400 2500 2600 Pupal weight of sires (micrograms) Figure 9.2 Mean weight of male pupae of the flour beetle Tribolntm castaneum, against pupal weight of father (sire). Each point is the mean of about eight male offspring. The regression coefficient of male offspring weight on sire's weight is b = 0. 11, and h z is estimated as 2b. (Courtesy of F.D. Enfield.) become clear in a moment, we are interested in the slope of the regression line. The slope is most easily expressed in terms of the covariance of x and y, defined as Cov(*,y) = [L(x - x)(y - y)]/n = (*y) - (x)(y), where the bar over a symbol means the average. This quantity is the sample covariance of x and y. The slope of the line through a cluster of points having the smallest summed squared distance to the points is the regression coefficient: b = Cov(jf,y)/Var(.r). A related quantity that also arises in quantitative genetics is the product-moment corr elation coeffici ent, often simply referred to as the correlation: r = Cov(*,y)/War (jc)Var (y). An important concept in statistics is the distinction between parameters and estimators. Descriptors that are calculated from a set of data to describe a sample (such as the sample mean and sample variance) are considered as estimates of the parameters that determine the true dislribution. The sample is thought of as having been drawn from some perfect distribution {whose parameters we can never know), and the sample statistics give us a best guess at what that true distribution is. In statistics, the distinction is general- ly made by unadorned Greek symbols for parameters and circumflexes for estimates. The usual symbols are: u for the parametric mean, o 2 for the variance, o^ for the covariance of x and y, and p for the correlation. Using the circumflex notation for estimates, p, denotes the sample mean of x, so that, ji r = x. Similarly, 6/ = Var(jr), and G xy = Cov(;t,y) are the sample estimates of the variance of x and the covariance of x and y. When describing models of quantitative genetics, it is the true distributions that are of interest, and so the parameters are used. When describing the results of an experiment, it is more 402 Chapter 9 appropriate to use the notation for estimates The covariance and the correla- tion coefficient are convenient measures of the degree of association between x and y. If x and y are independent, then o xv and p are both zero. Since the covariance between any two variables measures their degree of association, the covariance may be positive or negative. Positive covariance means that values of x and y tend to increase or decrease together; negative covariance means that, as one variable increases, the other tends to decrease. The limit- ing values of the covariance are -o x O v on the negative side, and o\a y on the positive. The limits are achieved only when the variables demonstrate a per- fect linear relationship with each other. Returning now to Figure 9.2, if Cov(jr,y) represents the covariance between phenotypic values of fathers (sires) and those of their male off- spring, and Var(x) represents the variance of phenotypic values of the fathers, then the slope of the regression line is equal to the regression coefficient, Cov(x,y)/Var(x), which can be seen as follows. Suppose that the equation of the line is represented as y = c + bx 9.1 where c and b are constants, b being the slope. Taking means of both sides yields y = c + bx subtracting the second equation from the first yields y - y = (c + bx) - (c - bx) = b{x -x) Now multiply through by x - x to obtain {x-x)ty-y) = b(x-xf Taking means of both sides produces Cov{jr,y) = bVar(x) In other words, the slope b of the regression line equals b = CovCr,y)/Var(r) 9.2 9.3 9.4 9.5 9.6 As noted, the slope is called the regression coefficient of offspring on one parent. A graphical interpretation of regression is illustrated in Figure 9.3, which shows the distribution, in two dimensions, of the variables x and y. The vari- ables may represent, for example, the phenotypic values of parents (x) and offspring (y). When there is no association between x and y, the distribution is a random scatter of points, and any line through the points fits equally badly. Figure 9.3 shows the appearance of the scatter of points for different ;> = 02 • • & = 06 l) = 0.9 * Figure 9.3 Plots of random scatters of points having the same variance on the r axis but a range of covariances. With zero covariance (top), the regression coef- ficient is zero. A stronger linear trend results in a higher regression coefficient 404 Chapter 9 values of association between the two variables. Note thai, while each para- meter measures an aspect of association between x and i/, the covariance, the regression coefficient, and the correlation coefficient are different things. For example, the covariance and the regression coefficient are unbounded, whereas the correlation coefficient must be between -1 and 1. Two extreme examples may help clarify parent-offspring regression. At one extreme, if there were no genetic contribution to the trait, then the scat- tergram might appear as a random scatter as in the top panel of Figure 9.3 with no tendency to follow a line. In such a case, knowing the phenotype of the parents would not help to predict that of the offspring, because there would be no parent-offspring resemblance. On the other hand, even with no genetic variation, the points might nevertheless show a substantial tendency to follow a line. To see why this is so, consider families living in different environments. In favorable environments with plenty of food and resources, parents and offspring might all be big and strong, while in unfavorable envi- ronments, parents and offspring might be small and sickly. A parent-off- spring plot would show that big strong parents have big strong offspring, while small sickly parents have small sickly offspring, even though there is absolutely no genetic basis lor the trait. The tendency of points to follow a line in a parent -offspring scattergram tells us nothing about the genetic basis of the trait, unless we are willing to make some claims (which hopefully can be tested experimentally) about the environmental covariance (the tendency of parents and offspring to resemble one another due to shared environ- ments). Only if there is no environmental covariance will the parent -offspring regression indicate a degree of genetic influence on the resemblance. The possibility of environmental covariance is absolutely critical in human quan- titative genetics, where the influence of shared environments can be very subtle and very strong. Assuming now that the environmental covariance is zero, the regression coefficient b of offspring on one parent can be calculated for any random-mating population, and it indicates the degree to which the variance in the trait is determined by genetic variation. It is for this reason that the regression coefficient is related to an important quantity in quantitative genetics called heritability. There are two types of heritability that will be distinguished shortly, but for now, we note that the "narrow -sense" heri- tability (h 2 ) can be estimated from the relationship b = V s /i 2 9.7 The V 2 occurs in Equation 9.7 because the regression involves only a sin- gle parent (the father, in the case of Figure 9.2), and only half of the genes from any one parent are passed on to the offspring. In Figure 9.2, b = 0.11, so h 2 = 0.22. Notice the considerable scatter among the points in the figure, which represents data from 32 families, Because this sort of scatter is typical, heritability estimates tend to be quite imprecise unless based on data from Quantitative Genetics 405 several hundred families. Note however, that even with an enormous sample, there would be no less scatter to the points — we would merely have a more accurate measure of how much scatter there is. One further point about Figure 9.2: in organisms such as mammals, the regression is better performed on the father's phenotype, rather than on the mother's, in order to avoid potential bias in the estimate of heritability caused by such maternal effects as intrauterine environment. In organisms where nurturing does not impart significant maternal effects, scattergrams can be constructed with the x axis being the average of the two parents (the midparent) and they axis the off- spring phenotypes. From this sort of plot the regression coefficient is equal to the heritability: in symbols, when the x axis is the midparent, b = h 2 . PROBLEM 9. 1 This example of calculating h 2 from parent-offspring regression uses data from Cook (1965), who studied shell breadth in 119 sibships of the snail Atlanta nrbustorum. For computational conve- nience, the data have been grouped into six categories. Estimate the heritability of shell breadth from these data. Number ofstbships Midparent value (mm) Offspring mean (mm) 22 16.25 17.73 31 18.75 19.13 W 21.23 20.73 11 23.75 22.84 4 26.25 2375 3 28.75 25.42 ANSWER Letting x refer to the midparent value and f refer to the offspring mean, then, I* 20*2626, y « 20.1786, It, 2 * 49,823.4375, %, 2 * 49,267.1875, 6 V * 5.1*26), d^ * 8.1W1, and f - & « 0.63. 0n actual practice we might not want to group the data into categories, because there is some loss of accuracy from grouping. The regression .coeffi- cient for the ungrouped data is b ■ 0.70. In addition, it should be noted that there is substantial assortative mating for shell breadth, and so the heritability estimate is artificially large.) To this point we have shown that heritability can be used to measure the degree of resemblance between parents and offspring. Although the defini- tion of heritability in terms of the regression coefficient between midparents and offspring is reasonable, heritability defined in this manner is merely a descriptive, empirical quantity because it makes no assumptions about 406 Chapter 9 genetics. In the next section we show how heritability in this purely statisti- cal sense can be used to predict the result of artificial selection. ARTIFICIAL SELECTION AND REALIZED HERITABILITY The deliberate choice of a select group of individuals to be used for breeding constitutes artificial selection. The most common type of artificial selection is directional selection, in which phenotypically superior animals or plants are chosen for breeding. Although artificial selection has been practiced suc- cessfully for thousands of years (for example, in the body size of domesti- cated dogs), only during this century have the genetic principles underlying its successes become clear. Understanding the genetic principles of artificial selection permits prediction of the rapidity and amount by which a popula- tion can be altered through artificial selection in any particular generation or small number of generations. The theory of artificial selection is also strong- ly motivated by the idea that natural selection may operate in a similar way. For example, if only those individuals with greater than a certain amount of body fat survive, or only those individuals with less than a critical rate of evaporative water loss survive, then natural selection acts on the distribution of phenotypes in much the same way that breeders select characters of agricultural importance. Artificial selection in outcrossing, genetically heterogeneous populations is usually successful in that the mean phenotype of the population changes over generations in the direction of selection (provided the population has not previously been subjected to long-term artificial selection for the trait in question). In experimental animals, the mean of almost any quantitative trait can be altered in whatever direction desired by artificial selection. For exam- ple, in Dmsophila, body size, wing size, bristle number, growth rate, egg pro- duction, insecticide resistance, and many other traits can be increased or decreased by selection. In domesticated animals and plants, birth weight, growth rate, milk production, egg production, grain yield, and countless other traits respond to selection. Figure 9.4 shows the results of a long-term selection program involving oil content in corn. Amazingly, the line selected for high oil content is still responding after more than 90 generations (Dudley and Lambert 1992). The general success of artificial selection in outcrossing species indicates that a wealth of genetic variation affecting quantitative traits exists. On the other hand, in a genetically uniform population, the mean phenotype of the population cannot usually be changed through artificial selection, because genetic variation is required for progress under artificial selection. For exam- ple, in experiments with the Princess bean, Johanssen (1909) found that arti- ficial selection consistently resulted in failure when practiced within essentially homozygous lines. He obtained this result because, in genetically Quantitative Genetics 407 30 40 50 Generation Figure 9.4 Results of a famous long-term experiment selecting for high and low oil content in corn seeds. Begun in 1896, the experiment has the longest duration of any on record and still continues at the University of Illinois. Note the steady, linear rise in oil content shown by the upper curve. The lower curve started on a roughly linear path and continued so for about ten generations, but then the response tapered off, presumably because zero percent oil is an absolute lower limit for the trait. (After Dudley and Lambert 1992.) homozygous populations, the only source of genetic variation comes from new mutations. In contrast, since genetically variable populations usually respond to artificial selection, and genetically uniform populations do not respond, the response to artificial selection might be used as a measure of the extent of genetic variation in the trait. This notion of selection response reflecting genetic variation will be formalized in the next section. Prediction Equation for individual Selection When individuals are selected for breeding based solely on their own indi- vidual phenotypic values, the type of artificial selection is called individual selection. Figure 9.5 illustrates a variety of individual selection called trun- cation selection. The curve in panel A represents the normal distribution of a quantitative trait in a population, and the shaded part of the distribution to 408 Chapter 9 S = H S -M Figure 9.5 Diagram of truncation selection. (A) Distribution of phenotypes in the parental population, mean u. Individuals with phenotypes above the trun- cation point (T) are saved for breeding the next generation. The selected parents are denoted by the shading and their mean phenotype by u.,. (B) The mean of the distribution of phenotypes in the progeny is denoted u'. Note that u' is greater than u but less than u s . The quantity 5 is called the selection differential, and R is called the response to selection. the right of the phenotypic value denoted T indicates those individuals selected for breeding. The value T is called the truncation point. The mean phenotype in the entire population is denoted u, and that of the selected par- ents is denoted Ug. When the selected parents are mated at random, their off- spring have the phenotypic distribution shown in panel B, where the mean phenotype is denoted u'. An example of truncation selection for seed weight in edible beans is shown in Figure 9.6. In this example, T = 650 mg, u = 403 .5 mg, u s = 691 .7 mg, and p' = 609.1 mg. In this case — as is typical of truncation selection — the off- spring mean u' is greater than the previous population mean u but less than the parental mean p s . The reason u' is greater than u is that some of the selected parents have favorable genotypes and therefore pass favorable genes Quantitative Genetics 409 250(1 r u r 403.5 \ 2000 - 1500 1000 500 - 1 n S = M s - M = fi91 7 - 403 5 = 288 2 150 250 350 450 550 650 750 Weight of seed (milligrams) N 250 200 150 100 - 50' = 609.1 - 403.5 = 205.6 150 250 350 450 550 650 750 850 950 Weight of seed (milligrams) Figure 9.6 Truncation selection experiment for seed weight in edible beans of the genus Pliaseolus, laid out as in Figure 9.5. The truncation point (7) is 650 mg. The selection differential S is the difference in means between the selected par- ents and the whole population. The response R is the difference in means between the progeny generation and the entire population in the previous gener- ation. The quantity R/S is the realized heritability. (Data from Johannsen 1903.) on to their offspring. At the same time, p' is generally less than p s for two reasons: 1. Because some of the selected parents do not have favorable genotypes; rather, their exceptional phenotypes result from chance exposure to exceptionally favorable environments. 2. Because alleles, not genotypes, are transmitted to the offspring, and exceptionally favorable genotypes are disrupted by Mendelian segrega- tion and recombination. 410 Chapter 9 The difference in mean phenotype between the selected parents and the entire parental population is the selection differential and is designated S. In symbols, S = p s - u 9.8 The difference in mean phenotype between the progeny generation and the previous generation is the response to selection and is designated R. Symbolically, J? = li' -li 9.9 In quantitative genetics, any equation that defines the relationship between the selection differential S and the response to selection R is known as a prediction equation. Since selection can be applied to a population in many different ways (others will be discussed later in this chapter), the pre- diction equation may differ corresponding to the different modes of selec- tion. A genera] prediction equation that applies to many forms of selection, including truncation selection (the type of selection illustrated in Figure 9.5), R = h 2 S 9.10 where h 2 is the realized heritability. Later in this chapter, we will show that the realized heritability is identical to the narrow-sense heritability defined by regression, provided the phenotypes and the magnitudes of genetic effects follow a bell-shaped Gaussian distribution. These assumptions are necessary in order to apply regression to the problem. This equivalence emphasizes again that heritability can be understood at several different lev- els. Equation 9.10 implies that the realized heritability ol a trait can be inter- preted as a mere description of what happens when artificial selection is practiced. In Figure 9.6, for example, S = 288.2 and R = 205.6, so h 2 =R/ S = 205.6/288.2 = 71.3%. When estimated like this from empirical data, h 2 is the realized heritability, and it simply summarizes the observed result. PROBLEM 9.2 Below are data on the number i of sternital bristles in samples from two consecutive generations d and G 2 of art experi- ment in directional selection for increased bristle number. In the Gj generation, individuals with 22 or more bristles (enclosed in brackets) were mated together at random to form the Gi generation. Estimate the realteed heritability of the number of sternltal bristle in this exper- iment (Data kindly provided by Trudy Mackay. In order to make the Quantitative Genetics 41 1 sexes comparable, the value of 2 has been added to the bristle number In males.) 15 2 20 20 13 25 HI 3 16 21 4 21 12 14 26 2 17 3 7 ■ 22 [131 12 27 IS ia 16 23 13] 6 28 2 19 17 17 24 , i .,,„. [51 3 ANSWEfc Brtdi^(rfd*ilM^areii»2220/15*19AAs = 22.7,A' * 203S/11 * 20.1. The selection differential S = 22.7 - 19.3 = 3.4 (Equa- tion 9J) and the responie k • 20.1 - 19.3 * 0.8 (Equation 9.9). The !«silfarfherftar*rf estimated from Equation 9.10 is ft 2 = 0.8/3.4 = 0.235. Data from experiments by Mackay (1985) demonstrate the potential sig- nificance of new mutations in quantitative genetics. The base population on which selection was performed was created by a cross that mobilizes the transposable element P that results in new P-element insertions in the germline and a syndrome of partial infertility and other reproductive abnor- malities known as hybrid dysgenesis. As a control, a genetically identical base population was formed by the reciprocal cross, in which the P element is not mobilized and hybrid dysgenesis does not occur. In the dysgenic cross, the realized heritability in abdominal bristle number was increased by 40% as compared with the nondysgenic control. More strikingly, the phenotypic variance of bristle number in the selected dysgenic lines increased by a factor of three over the course of eight generations. These results demonstrate that the genetic variation affecting quantitative traits may even include insertions of transposable elements. On the other hand, other comparable experiments using hybrid dysgenesis have not given such dramatic results. Selection Limits Progress under artificial selection does not continue forever. Any population must eventually reach a selection limit, or plateau, after which it no longer responds to selection. One of the reasons that a population eventually reach- es a plateau is exhaustion of genetic variance, such that all alleles affecting the selected trail have become fixed, lost, or are otherwise unavailable for selection. With no genetic variance, no progress under individual selection 412 Chapter 9 can be achieved. However, many experimental populations that have reached a selection limit readily respond to reverse selection (selection in the reverse direction of that originally applied), so genetic variance affecting the trait is still present. Indeed, in such populations, the phenotype may change in the direction of its original value if continuing artificial selection is simply suspended (relaxed selection). The consequences of relaxed selection for one example in Drosophila are illustrated in Figure 9.7. One frequent reason for the occurrence of selection limits in populations with considerable genetic variation is that artificial selection is opposed by nat- ural selection. In mice, for example, response to selection for small body size ultimately ceases because small animals are less fertile than larger ones, and the smallest animals are sterile (Falconer and Mackay 1996). Selection for small body size gradually becomes less effective due to the opposing effects of natur- al selection until, eventually, no further progress is possible. When selection is relaxed, the natural selection is unopposed and results in a retrogression in the artificially selected trait. Some backward slippage with relaxed selection also results from diminution in the linkage disequilibrium that usually builds up during the course of long-term artificial selection. If natural selection opposes the artificial selection, then when artificial selection is relaxed, natural selection results in at least a partial return to the initial phenotypic mean. 40 50 60 Genera Hon Figure 9.7 Response to selection for wind tunnel flight speed in Drosophila meimwgasfer. One line was maintained without selection for 30 generations start- ing at generation 65, and another was maintained without selection for 10 gen- erations starting at generation 85 (triangles). In these examples, the flight performance did not degrade after selection was relaxed. Apparently the selection response occurred with little correlated response on fitness. (After Weber 1996.) Quantitative Genetics 413 TABLE 9.1 SELECTION LIMITS AND DURATION OF RESPONSE FOR VARIOUS TRAITS IN LABORATORY MICE Character selected Direction of selection Total response Half-life of response* Weight (in strain N) Up 3.4 a,, 0.6N Down 5.6 a,, 0.6N Weight (in strain Q) Up 3 9 o,, 0.2W Down 3 6o> 0.4W Growth rate Up 2.0 a,, 0.3N Down 4.5 o,, 5N Litter size Up 1-2 a,, 05W Down 05 o,, 0.5N Source- From Falconer 1977. "Total response is expressed as a multiple of the initial phenotypic standard deviation, G p . h Half-life of response is the number of generations taken to progress halfway to the selection limit; here the half-life is expressed in multiples of effective population number (N) In most genetically heterogeneous populations, artificial selection can change the phenotype well beyond the range of variation found in the origi- nal population. Pertinent data for populations of mice are presented in Table 9.1. As can be seen, a total selection response of three to five times the origi- nal phenotypic standard deviation is not unusual, and for selection to change a population of effective size N halfway to its selection limit typically requires about 1/2M generations. In some cases the total response to artificial selection is very large. For example, in a long-term selection experiment for pupal weight in Triboti- um, in which the base population consisted of the progeny of a cross between two inbred lines, 100 generations of selection resulted in a popula- tion in which the mean pupal weight in the selected population was 17 standard deviation units greater than the mean in the base population (Enfield 1980). The ability to select a population in which virtually every phenotype is greater than the maximum in the original population strikes many students as paradoxical. It does seem plausible to argue that, if all of the alleles eventually selected are already present in the original popula- tion, then all possible favorable genotypes should be present also, though perhaps at low frequency. The fallacy in the argument is that real popula- tions subjected to artificial selection are actually small in size, consisting of at most a few hundred organisms. Therefore, if the favored alleles are rare, then the frequency of the favored genotypes may be so small thai the expected number of such genotypes will be much smaller than one, and so the superior genotypes, while theoretically possible, do not actually exist in the original population. 414 Chapter 9 Some traits consistently fail to respond to artificial selection, suggesting a lack of suitable genetic variation. Bilateral symmetry is an example of a trait that has not been amenable to change by artificial selection. The failure of Maynard-Smith and Sondhi (1961) to create bilateral asymmetry in Drosophila by selecting for an excess of dorsal bristles on the left side is typi- cal. The apparent lack of genetic variation determining bilateral asymmetry is of interest in regard to embryonic development for it implies that the genetic control of development of symmetrical structures specifies patterns that are common to the left and the right sides of the body. That is, rather than left-bristle genes and right-bristle genes, there appear to be generic bristle genes whose spatial expression is determined symmetrically. Of course asymmetrical structures do exist (such as the vertebrate heart) and recently inroads have been made in understanding the molecular genetic basis for this asymmetry (rsaac et al. 1997). Genes that affect left-right asym- metry do not do so in a continuous manner — rather they either successfully establish the asymmetry or they do not; absence of symmetry is fatal. Not all traits with heritable variation obey the prediction equation and show a simple linear change in the mean. Sometimes a trait responds to directional selection for a few generations, then ceases to respond, but later responds again as selection is continued. One possible mechanism for this stop-and-start response is that the population at a plateau is in linkage dise- quilibrium, and it takes time for recombination to break up the allelic associ- ations and release the latent genetic variation. This phenomenon was observed in a long-term study of the quantitative genetics of wing veins in Drosophila (Scharloo 1987). In this case a bimodal phenotypic distribution was also generated during selection (Figure 9.8), which was proposed to reflect a nonlinear mapping from genetic and environmental factors to the determination of phenotype. As we have seen, heritability can be interpreted in purely statistical terms with no genetic content. However, if we postulate that there are Mendelian genes underlying the phenotypes, then the genetic underpinning allows us to do more than merely describe statistical relations among individuals. By bringing Mendelian genetics into the picture, we will see why the response to any kind of artificial selection is determined by the magnitude of the heri- tability. In particular, the genetic basis of response to artificial selection comes from changes in gene frequencies and sometimes also to changes in linkage disequilibrium. GENETIC MODELS FOR QUANTITATIVE TRAITS When h 2 is interpreted as realized heritability, then Equation 9.10 is hardly a "prediction equation" inasmuch as it merely describes what has already hap- pened in one generation of selection. Ol course, the equation could be used to predict the result of the next generation of selection, but artificial selection . a/ yw^ ^j}x ./rA. ,/^n , 50 1(10 50 Length of fourth wing vein as percentage of llurd wing vein Figure 9.8 Frequency distributions in females (left) and males (right) of a line of Drosophila meltmogaster selected for fourth wing vein length The broken lines represent selection for a short vein, and solid lines represent selection for a long vein. In the line selected for long veins, both sexes displayed a bimodal fre- quency distribution when the relative vein length was approximately 60-S0%. (From Scharloo 1987.) II r 416 Chapter 9 is impossible in many natural populations and is time consuming and expensive in many domesticated plants and animals, tt would therefore be useful if one could estimate heritability without actually performing any artificial selection. If the heritability h l could be estimated in such a manner, then Equation 9.10 would be a true prediction equation in the sense that the response R could be predicted for any selection differential S, based on the estimated value of h 1 . Such an estimate of h z is indeed possible, but it involves an understanding of heritability at a level that includes the under- lying genetic basis of quantitative traits. An understanding of the genetics behind Equation 9.10 requires three items: (1) a concept of how alternative alleles of a gene aftect a quantitative trait; (2) a determination of how selection changes the allele frequencies; and (3) a calculation of how much the mean of the trait increases as a result of the change in allele frequency. Some detail is required to establish these three items, but the detail is necessary in order to understand the genetic meaning of heritability. Nilsson-Ehle (1909) was the first to show that a trait with a nearly contin- uous distribution of phenotypes could result from the joint effects of several genes. The trait of interest is the intensity of red pigment in the glume of wheat Tritiaim vufgare, which Nilsson-Ehle found to result from three unlinked genes, each with two alleles. The situation is exceptionally simple for a quantitative trait; the environment has a negligible effect on phenotype, because the alleles of each gene are additive (i.e., heterozygotes have a phe- notype that is exactly intermediate between homozygous phenotypes), and because the genetic effects are also additive across genes {i.e., the total genet- ic effect of any three-locus genotype is just the sum of the separate effects of each gene). To simplify matters, consider just two of the genes, and let their alleles be denoted (A, a) and (B, b). With additivity within and across genes, we may assume that the genotype aabb has a color score of (white) and that each A or B allele in the genotype contributes one unit of red pigment. Figure 9.9 A shows the nine possible two-gene genotypes, their frequencies with ran- dom mating when the allele frequencies of ,4 and B are both V 2 , and the color score of each genotype assuming additivity. The mean color score of the pop- ulation is 2. Indeed, when the allele frequencies of A and B are both p, then the mean of a population with random mating can be shown to equal Ap. To connect this trait with the prediction, Equation 9.10, suppose that the two lowest phenotypic classes (i.e., and 1) are selected as parents of the next generation. We first calculate u s , u', S, R, and h 2 = R/S using the allele fre- quency of A and B among selected parents; then we use the mean = Ap for- mula to obtain the mean of the offspring with random mating. In this example, u = 4(V 2 ) = 2 is given. The selected parents consist of genotypes Aabb, aaBb, and aabb with respective frequencies 2 / 5 , %, and V5, and the mean of parents = u s = ( 2 / 5 )(l) + (%)(1) + %){Q) = %■ The allele frequency of A and B among parents = (V2)( 2 A) = %, and therefore the mean among off- Quantitative Genetics 4 1 7 Additive BB AA Aa aa Vu y* y« Bb y.* % 2 A bb ® y« (B) Dominant AA Aa aa BB m> l /l6 bb O O y* ® Figure 9.9 Frequencies of two-locus genotypes (outside circles) and respective phenotypes (within circles) in a population with allele frequency V2 for each locus. Panel A illustrates the case of additivity of effects at each locus and across loci. In panel B, A and B are each dominant to a and b respectively, but the effects of the two loci are additive. spring is u' = 4(Vs) = 4 / 5 . Then S = (%) - 2 = -% and R = (%) - 2 = -%, so h 2 = R/S = 1.0. As demonstrated in the next paragraph, this high heritability is due to the additivity within and across genes and not merely to the fact that environmental effects are negligible. Figure 9.9B refers to a hypothetical situation in which the A and B alleles are dominant but still additive across genes. Thus, genotypes AA, Aa, BB, and Bb each add one unit of red pigment to the phenotype. In this case, it can be shown that the mean of a random-mating population with allele fre- quencies of A and B both equal to p is given by 2p(l + q), where q = 1 - p. If the two lowest phenotypic classes (i.e., and 1) are selected as parents of the 418 Chapter 9 next generation, then the mean of parents = p s = (V7XI) + (V7KI) + (V7KI) + ( 2 / 7 )(l) + (V7MO) = V7- The allele frequency of A and B among parents is P - (V7) + (V^X 2 /?) = 2 /7/ and the mean of the offspring is thereiore p' = 2( 2 / 7 )[l + m\ = 48 /«- Thus, S = (*/ 7 ) - (-y 2 ) = -y ]4 and R = («/ 49 ) - (%) = -%. In the case where A and B are dominant, so h 2 = 51 / 63 = 0.81. Although environ- mental effects on seed color are still negligible in the dominance case, the heritability has become less than 1.0. This perhaps surprising result occurs because certain genetic effects (such as those resulting from dominance or, in other examples, nonadditivity across genes) are not useful in changing a population by means of the type of individual selection discussed here. To see how an underlying genetic model can be formulated for continuous characters, refer to Figure 9.10, which shows the normal distribution of a trait in a hypothetical random mating population In truncation selection, all indi- viduals with phenotypes above the truncation point Tare saved for breeding, and the shaded area B of the distribution represents the proportion of the pop- ulation selected. (The total area under any normal density equals 1.) The height of the normal density at the point T is denoted Z, and, as before, the mean phenotype among the selected individuals is called ps. One of the spe- cial properties of the normal distribution to be used below is that (ps-pW^Z/B 9.11 To determine the amount of increase in mean phenotype in a population resulting from one generation of truncation selection, we first imagine a gene Figure 9.10 Normal distribution of a quantitative trait in a hypothetical popu- lation, showing some important symbols used in quantitative genetics. Here p is the moan of the population, T the truncation point, Z the height (ordinate) of the normal density at the point T, B is the shaded area under the normal curve to the right of T, and u P is the mean among selected parents. Quantitative Genetics 419 that affects the trait in question and that has alleles A and A' with respective allele frequencies p and q. Because of random mating, genotypes AA, AA' , and A' A' are present in the population with frequencies p , 2pq, and q 2 , respectively, but the individual genotypes cannot be identified through their phenotypic values because of the variation in phenotype caused by environ- mental factors and genetic differences in other genes. If the genotypes could be identified, their individual distributions of phenotypic value might appear as shown in Figure 9.11. Each distribution is normal and has the same vari- ance, but the means are very slightly different. The mean phenotypes of AA, AA', and A' A' genotypes are denoted p* + a, p* + d, and p* - a, respectively. The symbols a and d serve as convenient representations of the effects of the alleles in question on the quantitative trait. The difference between means of homozygotes is (p* + a) - {p* - a) = 2a, and d/a serves as a measure of domi- nance. The relationship d = a means that A is dominant, d = implies addi- tivity (heterozygotes exactly intermediate in phenotype between the homozygotes), and d = -a means that A' is dominant. (Use of a and d in this manner simplifies some of the subsequent formulas.) Calculation of a and d for an actual example involving two alleles that affect coat coloration in guinea pigs is illustrated in Table 9.2. In this case, a = 0.127, d = -0.016 (the negative sign on d means that the c 4 allele is partially dominant), and Distribution in whole population Distribution in AA Distribution in AA' Distribution in A' A' u*-fl ji'-i-d n*+r? Figure 9.11 Same distribution as in Figure 9.10, showing the slightly different distribution of phenotypic value among the Ihree genotypes {AA, AA', and A' A') for a gene with two alleles that contributes to the quantitative trait. The means of the distributions of AA, AA', and A' A' are symbolized p* + a, u* + d, and u* - a, respectively. 420 Chapter 9 TABLE 9.2 CALCULATION OF u\ a, AND d FOR ALLELES AT A LOCUS AFFECTING COAT COLORATION IN GUINEA PIGS* Genotype Amount of black coloration 1 ' c'c r (AA) tV (AA') c-'c' 1 (A' A') 1 202 = M * + a = 1 075 + 0.127 1.059= u* + rf = 1.075- 0.016 0.948 =u*- a = 1.075- 0.127 p* = {1.202 + 0.948)/2 = 1.075 a = 1.202 - 1 075 = 0.127 5=1.059- 1.075= -0.016 Source Data from Wright 1968 "The calculations to be earned out first are those beneath the dMa; then the right-hand column is completed Here the amount of black coloration is measured as arcsin (41), where x is the percentage of black coloration on the animal. For cV, <V, and cV genotypes, the corresponding x values are 877.., 76%, and 66%, respectively. df a = -0.126. Assuming Hardy-Weinberg genotype frequencies, the mean phenotype in the entire population is u = pV + a) + 2p?(u* + £/) + q\\x* - a) 9.12 PROBLEM 9.3 Crosses between the Danmark (P,J and Red Currant (Pj) tomato gave the following mean fruit weights and their tog trans- forms. P, and P 2 are the parental means, Fi and F 2 are the first and sec- ond hybrid generation, and Bj and B2 are the progeny of the backcross of Fi x P] and Fi x F h respectively. Expectatf mwn Mean welghi Log (weight) Pi U + fl 10.36 ± 0.581 0.98 ±0.03 p 2 p-a 0.45 ± 0.017 -0.36 ±0.02 Fi u + rf 2.33 ± 0.130 0.33 ±0.03 h u + '/jtf 2.12 ± 0.105 0.27 ±0.01 B, H + y 2 (a+<f) 4.82 ± 0.253 0.64 ± 0.02 02 ji + V&W-*) 0.97 ±0.045 -0,05 ± DM Use this information to calculate jj, a, and d for both the weights and the log transformed weights. Do the simple weights or the log trans- formed weights fit the model better? (Data from Powers 1951.) Quantitative Genetics 421 AN SWE R The difference between the two parental means is 2a, so a = (10.36 - 0.45)/2 = 4.96. This gives fi = 5.4. The V x has a mean u + d = 2.33, so 1 = 2.33 - 5.4 = -3.07. The F 2 should have mean (%)(\i + a) + V 2 (0 + (f) + V 4 (p - a) = (i + Vid = 5.4 + y 2 (-3.07) * 3.86. The B^ refers to backcrosses of the Fi to Pi, which should yield one-half of genotypes like P! and one-half of genotypes like the F u so the mean should be V 2 (p + a) + V 2 (u + d) » u + *fa{a + d) = 6.34. Similar reasoning gives an expected mean for Bj of 1 .38. The estimates for the means of me Fa, Bt, and B 2 do not fit very well at all. Trying again with the log trans- formed data, we get a ■ 0.67, ji « 0.31, and S « 0.02. The expected means for the F 2 , Bj, and B 2 are then 0.31 + V^Q.Ol)-* 0.32, 0.31 + i/ 2 (0.67 + 0.02) m 0.65, and 0.31 + y 2 (0.02 - 0.67) * -0.01. The log trans- formed data clearly fit much better, suggesting that the better scale to use for the quantitative genetic models is the log transformed scale. In actual practice, the entire set of data is used to estimate u, a, and d by a melhod known as least squares, and me goodness of fit of the model to the data can be tested by a chi-square test. Effects of the scale of measurement are known as scaling effects. For exam- ple, the a and d values in Table 9.2 are different when calculated for the per- cent of black coloration x or for the arcsin (V*) tabulated values. Since estimates of the additive and dominance values of alleles depend on scaling, so does the heritability. An important point is that the equivalence between the heritability defined by parent-offspring regression and by realized heri- tability depends on the correct choice of scaling. Only one scaling provides a normal Gaussian distribution of phenotypes and of genetic effects, and that is the appropriate scaling that yields the prediction Equation 9.10. Change in Gene Frequency Suppose for the moment that we were practicing artificial selection for increased amount of black coat coloration in the guinea pigs in Table 9.2. Selection for black-coat coloration in a population containing both the c" (i.e., A) and c d (i.e., A') alleles would be successful in increasing the allele frequency of A, and the average amount of black coloration among indi- viduals of the next generation would increase. Therefore, in order to calculate the expected increase in black coloration in one generation of selection, we must first calculate the corresponding change in the allele frequency of A. An equation for change in allele frequency with nalural 422 Chapter 9 selection was derived in Chapter 6, which remains valid for artificial selec- tion if we agree to interpret the "fitness" of an individual as the probabil- ity that the individual is included among the group selected as parents of the next generation. With this interpretation of fitness, differences in fit- ness (i.e., reproductive success) of AA, AA', and A'A' genotypes corre- spond to the differences in area to the right of the truncation point in Figure 9.11, because only those individuals in the shaded area are allowed to reproduce. The differences in area are easy to calculate if you shift or slide each curve horizontally until its mean coincides with u*. The A'A' curve must slide a units to the right, and the AA' and AA curves must slide d and a units to the left. This shifting brings the distributions into coinci- dence, but it slides the truncation points slightly out of register, as shown in Figure 9.12. The difference in "fitness" between AA and AA', denoted iv u - w n (as in Chapter 6), is equal to the small area indicated in Figure 9.12, as is the difference in fitness between AA' and A'A', denoted w l2 - xu 12 . The areas corresponding to w n - w n and W\ 2 - w 22 are approximately rectangles, and the area of a rectangle is the product of the base and the height. The approximation is most accurate when the effect of this one locus on the phenotype is small. Therefore, since Z represents the height of the normal distribution at the point T, we can make the following approximations w n - w n - Z\(T -d)-(J~ a)} = Z(a - d) iv n - w u ~ Z[(T + a) - (T - d)] = Z(a + d) 9.13 The average fitness w of the entire population simply equals B, because B is the proportion of the population saved for breeding. From Chapter 6 we know that Ap = pq[p{w n - w n ) + q{w n - rofe)|/ w where Ap is the change in frequency of the allele A in one generation of selec- tion. Substituting from Equation 9.13 and using w=B leads to or, since p + q = 1, Ap = pq[pZ(a -d) + qZ(a + d)]/B Ap = (Z/B)pq[a + (q-p)d] 9.14 9.15 An equation corresponding to 9.15 could be obtained for any gene affect- ing the trait, but the values of p, a, and d would differ for each gene. The quantity in square brackets in Equation 9.15 is called the average excess. A generalization that accounts for nonrandom mating is found in Falconer (1985). Quantitative Genetics 423 Distributions in A'A', AA',<tndAA,s\vf\vd to coincide s Distn bn turn in entire population *C/Area=-te 12 -N> n T-a T-d r + rt Figure 9.1 2 Same distribution as in Figures 9.10 and 9.11, but with the distrib- utions of AA, AA', and A'A' shifted laterally to coincide. Shifting the distribu- tions slides the truncation points slightly out of register, so the truncation points for AA, AA', and A'A' become T-a, T- d, and T + n, respectively. The small area that is denoted w n - w u is the difference between the proportions of AA and AA' genotypes that are included among the selected parents, and the area Wn - W22 is the difference in the proportion of AA' and A'A' genotypes included among the selected parents. Genetic Model for the Change In Mean Phenotype Equation 9.15 provides an expression for Ap which can be used to calculate the mean phenotypic value of coat color after one generation of selection. In the next generation, the allele frequencies of A and A' are p + Ay and q - Ap, respectively. With random mating, the mean phenotype in this generation is given by Equation 9.12 as |i' = (p + Ap) 2 (u* f a) + 2(p + Ap)(q-Ap)(v* + d) + 07-Ap)V-«)- 9.16 When the right-hand side of this expression is multiplied out and terms in (Ap) 2 are ignored because Ap is usually small, then u' is found to be approx- imately \x'~\\ + 2[a + {q-p)d\Ap 9.17 The approximation in Equation 9.17 is rather good even for relatively large values of Ap. 424 Chapter 9 Equation 9 17 warrants a little more development since it yields the pre- diction equation R = h 2 S (Equation 9.10) and also provides an expression for h 2 in terms of the parameters n, rf, and p that can be interpreted genetically. First, rewrite Equation 9.17 as 9.18 9.19 9.20 p'-)i = 2la + {q-p)d]Ap Then substitute for Ap from Equation 9.15, which yields li'-p = (Z/B)2p# + ft-p)rf] 2 Now use the expression for Z/0 given in Equation 9.11 to obtain u' - u = (ps ~ M)2p# + {q - p)d\ 2 /a 2 Finally, substitute from Equations 9.8 and 9.9 for the selection differential S and the response R r yielding R = (S)2pq[a + {q-p)ci] 2 /o 2 9.21 However, R = h 2 S also (Equation 9.10), and so h z = 2pq[a + (q-p)d] 2 /<5 i 9.22 Equation 9.22 for h 2 is the one we were after, as it defines the heritability in terms of p, q, a, and d — each of which has a genetic meaning. Equation 9.22 is a valid approximation when a single gene affects the trait in question, and when the effects of that gene are small. However, when many genes affect the trait, the right-hand side of the equation must be replaced by a summation of such terms, one for each gene. That is, for many genes, R = h 2 S where h 2 = Z2pq[a + (q-p)d] 2 /a 2 9.23 in which the summation is over all genes that affect the trait. (However, each gene may have different values of a, d, p, and q.) As will be discussed in more detail later, the quantity V 2 = 12pq[a + (q-p)df 9.24 is called the additive genetic variance of the trait. Although the individual components in the additive genetic variance are difficult to identify except in contrived examples like the one involving guinea pigs, the collective effects (represented by the summation) can be estimated. COMPONENTS OF PHENOTYPIC VARIANCE As Equation 9.24 suggests, the variance of a quantitative trait can be split into various components representing different causes of variation. Similarity between relatives is conveniently expressed in terms of the vari- Quantitative Genetics 425 ance components, but variance partitioning is also of interest in its own right. Since the rate of change of a trait under selection depends on the amount of genetic variation affecting the trait, if there is no genetic variation, there is obviously no response to selection. What is not so obvious is that some com- ponents of genetic variation cannot be acted upon by some kinds of selec- tion. In other words, certain populations have ample genetic variation, yet fail to respond to selection. The part of the genetic variation amenable to selection is clarified by partitioning the variance. Genetic and Environmental Sources of Variation As shown in Table 9.3, the phenotypic value of any individual can be repre- sented as a sum of three components: (1) the mean u of the entire population, (2) a deviation from the population mean due to the specific genotype of the individual in question (symbolized as Gj, G 2 , and G 3 for AA, AA', and A' A' genotypes, respectively), and (3) a deviation from the population mean due to the specific microenvironment of the individual in question. (The envi- ronmental deviations are unique to each individual and are represented as E t , E 2 , . . . , E<).) These microenvironmental effects might be due to random differences in nutrition, temperature, or other external factors, or they might be seen even in an absolutely uniform external environment due to the vagaries of embryonic development. It is important to note that the Gs and Es are not directly observable. Nevertheless, as we shall see, the total vari- ance in phenotypic value can be partitioned into a component due to varia- tion among the Gs and another component due to variation among the Es. The model can be summarized by writing P=u+G+E 9.25 TABLE 9.3 PHENOTYPES OF VARIOUS GENO- TYPES AS THE SUM OF p, G, AND £* Genotype Phenotypic Value AA AA AA AA' A A' AA' A'A' A'A' A'A' H + G,+£, u + C\ + E 2 u + G, + £ 3 u + G 2 + E 4 u + G 2 + E 5 P *■ G 2 + E 6 u f G-, + E 7 p + Gj + E R u + G 3 + £, " u is the population mean G is a contribution due to genotype, different for each genotype. E is- a contribution due to environ- ment, different for each individual 426 Chapter 9 where P represents the phenotypic value of any individual and G and E are the genotypic and environmental deviations pertaining to that individual. To connect the above symbols with actual numbers, we may use Table 9.2 and assume an allele frequency of A of p = 0.2. Equation 9.12 then implies that the mean of the population is u = 0.994. Thus, the respective G lf G 2 , and G 3 deviations for AA, AA' r and A' A' genotypes are d = 1.202 -0.994 = 0.208 G 2 = 1.059 -0.994 = 0.065 G 3 = 0.948 -0.994 = -0.046 For a particular animal of genotype AA whose actual coat color score is, for example, 1.312, the corresponding value of E for the animal would be calcu- lated using Equation 9.25 from the expression 1 .3 12 = 0.994 + 0.208 + E; thus, for this animal, E = 0.11. Similarly, a particular animal of genotype AA' with an actual phenotype of P = 1.009 would have a value of E given by 1.009 = 0.994 + 0.065 + E, or E = -0.05. Because the E values are defined as deviations from their mean, the average of Es for any genotype is 0. Likewise, since the Gs are defined as deviations from their mean, the mean of the Gs is 0. This result can be verified in the guinea pig example because (0.2) 2 G, + 2(0.2)(0.8)G 2 + (0 8) 2 G 3 = Equation 9.25 is appropriate when the effects of genotype and environment are additive — that is, when the deviation of the phenotype of any particular individual from the population mean (P - u) can be written as the sum of an effect resulting from the genotype of thai individual and a separate effect resulting from the environment of that individual. PROBLEM 9 A In Prcblern 9.3 the values of u, a, and d wefe found to be 0.31, 0.67, and 0.02, respectively, for the logarithms of tomato weight Calculate the additive genetic variance in the F 2 population, the Bj population and the Bj population. ANSWER In the F 2 population the allele frequency ii p = if = Va, so the formula for; the additive genetic variance (Equation 9.24) is o£ » 2pqa 2 = V 2 fl 2 = 0.224. In the backcross 1 population Bi, the allele Quantitative Genetics 427 frequencies are p - 3 / 4 and <j = Vi, so, applying Equation 9.24, we get a; = 0.173. The backcross 2 population B 2 has allele frequencies p = y 4 and q = 3 4 so o* = 0.163. When the dominance parameter is so small, the additive variance is at a maximum when the allele frequen- cies are both l / 2 , and the graph of additive variance against allele fre- quency is symmetric. To this point the discussion has been restricted to a particular population in a single maeroenvironment, and the sources of variation have been due to genetic and microenvironmental differences among individuals. A change in maeroenvironment is easiest to see in an experimental setting, where, for example, in maeroenvironment 1 all of the guinea pigs get twice as much food as in maeroenvironment 2. Additivity of genetic and environmental effects is true whenever the ratio of G\ : G 2 : G.i is the same in each of the rele- vant environments. For the genotypes in Figure 9.13, for example, if the actu- al range of environments is the range designated E 1( then the genetic and environmental effects are additive because the ratio G^.Gi: G 3 is the same for Environment -► Figure 9.1 3 The norm of reaction is the relation between the phenotype and the environment, and this relation is known to vary from genotype to genotype Hypothetical norms of reaction for genotypes AA, AA' and A' A' are shown here. In the range of environments denoted E,, A is very nearly dominant to /V(that is, AA and AA' have nearly the same phenotype). However, in the range E 2 , A and A' are very nearly additive (no dominance). The heritability of the trait resulting from this gene differs according to whether the population is reared in E, environments or E 2 environments. 428 Chapter 9 any particular environment in £j. For the same reason, the genetic and envi- ronmental effects are additive if the actual range of environments is E 2 . However, if the actual range of environments includes both Ej and £ 2 , then the ratio Gi:G2: G 3 depends on the particular environment, and there- fore the genetic and environmental effects may not be additive. Nonadditiv- ity of genetic and environmental